Why does the packet number in my audio queue input callback vary? - macos

I use Audio Queue Services to record PCM audio data on Mac OS X. It works but the number of frames I get in my callback varies.
static void MyAQInputCallback(void *inUserData, AudioQueueRef inQueue, AudioQueueBufferRef inBuffer, const AudioTimeStamp *inStartTime, UInt32 inNumPackets, const AudioStreamPacketDescription *inPacketDesc)
On each call of my audio input queue I want to get 5 ms (240 frames/inNumPackets, 48 kHz) of audio data.
This is the audio format I use:
AudioStreamBasicDescription recordFormat = {0};
memset(&recordFormat, 0, sizeof(recordFormat));
recordFormat.mFormatID = kAudioFormatLinearPCM;
recordFormat.mFormatFlags = kLinearPCMFormatFlagIsSignedInteger | kAudioFormatFlagsNativeEndian | kAudioFormatFlagIsPacked;
recordFormat.mBytesPerPacket = 4;
recordFormat.mFramesPerPacket = 1;
recordFormat.mBytesPerFrame = 4;
recordFormat.mChannelsPerFrame = 2;
recordFormat.mBitsPerChannel = 16;
I have two buffers of 960 bytes enqueued:
for (int i = 0; i < 2; ++i) {
AudioQueueBufferRef buffer;
AudioQueueAllocateBuffer(queue, 960, &buffer);
AudioQueueEnqueueBuffer(queue, buffer, 0, NULL);
My problem: For every 204 times of 240 frames (inNumPackets) the callback is once called with only 192 frames.
Why does that happen and is there something I can do to get 240 frames constantly?

Audio Queues run on top of Audio Units. The Audio Unit buffers are very likely configured by the OS to be a power-of-two in size, and your returned Audio Queue buffers are chopped out of the larger Audio Unit buffers.
204 * 240 + 192 = 12 audio unit buffers of 4096.
If you want fixed length buffers that are not a power-of-two, your best bet is to have the app re-buffer the incoming buffers (save up until you have enough data) to your desired length. A lock-free circular fifo/buffer might be suitable for this purpose.


Turn off sw_scale conversion to planar YUV 32 byte alignment requirements

I am experiencing artifacts on the right edge of scaled and converted images when converting into planar YUV pixel formats with sw_scale. I am reasonably sure (although I can not find it anywhere in the documentation) that this is because sw_scale is using an optimization for 32 byte aligned lines, in the destination. However I would like to turn this off because I am using sw_scale for image composition, so even though the destination lines may be 32 byte aligned, the output image may not be.
Full output frame is 1280x720 yuv422p10le. (this is 32 byte aligned)
However into the top left corner I am scaling an image with an outwidth of 1280 / 3 = 426.
426 in this format is not 32 byte aligned, but I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
This is why I need to actually disable this optimization or somehow trick sw_scale into believing it does not apply while keeping intact the way the program works, which is otherwise fine.
I have tried adding extra padding to the destination lines so they are no longer 32 byte aligned,
this did not help as far as I can tell.
Edit with code Example. Rendering omitted for ease of use.
Also here is a similar issue, unfortunately as I stated there fix will not work for my use case. https://github.com/obsproject/obs-studio/pull/2836
Use the commented line of code to swap between a output width which is and isnt 32 byte aligned.
#include "libswscale/swscale.h"
#include "libavutil/imgutils.h"
#include "libavutil/pixelutils.h"
#include "libavutil/pixfmt.h"
#include "libavutil/pixdesc.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
/// Set up a 1280x720 window, and an item with 1/3 width and height of the window.
int window_width, window_height, item_width, item_height;
window_width = 1280;
window_height = 720;
item_width = (window_width / 3);
item_height = (window_height / 3);
int item_out_width = item_width;
/// This line sets the item width to be 32 byte aligned uncomment to see uncorrupted results
/// Note %16 because outformat is 2 bytes per component
//item_out_width -= (item_width % 16);
enum AVPixelFormat outformat = AV_PIX_FMT_YUV422P10LE;
enum AVPixelFormat informat = AV_PIX_FMT_UYVY422;
int window_lines[4] = {0};
av_image_fill_linesizes(window_lines, outformat, window_width);
uint8_t *window_planes[4] = {0};
window_planes[0] = calloc(1, window_lines[0] * window_height);
window_planes[1] = calloc(1, window_lines[1] * window_height);
window_planes[2] = calloc(1, window_lines[2] * window_height); /// Fill the window with all 0s, this is green in yuv.
int item_lines[4] = {0};
av_image_fill_linesizes(item_lines, informat, item_width);
uint8_t *item_planes[4] = {0};
item_planes[0] = malloc(item_lines[0] * item_height);
memset(item_planes[0], 100, item_lines[0] * item_height);
struct SwsContext *ctx;
ctx = sws_getContext(item_width, item_height, informat,
item_out_width, item_height, outformat, SWS_FAST_BILINEAR, NULL, NULL, NULL);
/// Check a block in the normal region
printf("Pre scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
/// Check a block in the corrupted region (should be all zeros) These values should be out of the converted region
int corrupt_offset_y = (item_out_width + 3) * 2; ///(item_width + 3) * 2 bytes per component Y PLANE
int corrupt_offset_uv = (item_out_width + 3); ///(item_width + 3) * (2 bytes per component rshift 1 for horiz scaling) U and V PLANES
printf("Pre scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
sws_scale(ctx, (const uint8_t**)item_planes, item_lines, 0, item_height,window_planes, window_lines);
/// Preform same tests after scaling
printf("Post scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
printf("Post scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
return 0;
Example Output:
//No alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 512 36865 36865
//With alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 0 0 0
I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
That's actually correct, swscale indeed does that, good analysis. There's two ways to get rid of this:
disable all SIMD code using av_set_cpu_flags_mask(0).
write the re-scaled 426xN image in a temporary buffer and then manually copy the pixels into the unpadded destination plane.
The reason ffmpeg/swscale overwrite the destination is for performance. If you don't care about runtime and want the simplest code, use the first solution. If you do want performance and don't mind slightly more complicated code, use the second solution.

Compute accurately the pts and dts of a video packet

Is there a way to compute the encoded AVPacket.pts and AVPacket.dts? If the encoded packet doesn't contain a duration?
I tried computing it by starting the timestamp at 0, then increase that timestamp with the computed duration of the video frame. My computation for the duration below:
if (video_ts < 0) video_ts = 0;
video_ts += (int64_t)last_duration;
compressed_video.pts = video_ts;
compressed_video.dts = video_ts;
last_duration = ((compressed_video.size * 8) / (double)out_videocc->bit_rate) * 1000;
It worked slightly but it is not exact. The playback stutters

Why does every encoded frame's size increase after I had use to set one frame to be key in intel qsv of ffmpeg

I used intel's qsv to encode h264 video in ffmpeg. My av codec context settings is like as below:
m_ctx->width = m_width;
m_ctx->height = m_height;
m_ctx->time_base = { 1, (int)fps };
m_ctx->qmin = 10;
m_ctx->qmax = 35;
m_ctx->gop_size = 3000;
m_ctx->max_b_frames = 0;
m_ctx->has_b_frames = false;
m_ctx->refs = 2;
m_ctx->slices = 0;
m_ctx->codec_id = m_encoder->id;
m_ctx->codec_type = AVMEDIA_TYPE_VIDEO;
m_ctx->pix_fmt = m_h264InputFormat;
m_ctx->compression_level = 4;
m_ctx->flags &= ~AV_CODEC_FLAG_CLOSED_GOP;
AVDictionary *param = nullptr;
av_dict_set(&param, "idr_interval", "0", 0);
av_dict_set(&param, "async_depth", "1", 0);
av_dict_set(&param, "forced_idr", "1", 0);
and in the encoding, I set the AVFrame to be AV_PICTURE_TYPE_I when key frame is needed:
encodeFrame->pict_type = AV_PICTURE_TYPE_I;
encodeFrame->pict_type = AV_PICTURE_TYPE_NONE;
avcodec_send_frame(m_ctx, encodeFrame);
avcodec_receive_packet(m_ctx, m_packet);
std::cerr<<"packet size is "<<m_packet->size<<",is key frame "<<key_frame<<std::endl;
The strange phenomenon is that if I had set one frame to AV_PICTURE_TYPE_I, then every encoded frame's size after the key frame would increase. If I change the h264 encoder to x264, then it's ok.
The packet size is as below before I call "encodeFrame->pict_type = AV_PICTURE_TYPE_I":
packet size is 26839
packet size is 2766
packet size is 2794
packet size is 2193
packet size is 1820
packet size is 2542
packet size is 2024
packet size is 1692
packet size is 2095
packet size is 2550
packet size is 1685
packet size is 1800
packet size is 2276
packet size is 1813
packet size is 2206
packet size is 2745
packet size is 2334
packet size is 2623
packet size is 2055
If I call "encodeFrame->pict_type = AV_PICTURE_TYPE_I", then the packet size is as below:
packet size is 23720,is key frame 1
packet size is 23771,is key frame 0
packet size is 23738,is key frame 0
packet size is 23752,is key frame 0
packet size is 23771,is key frame 0
packet size is 23763,is key frame 0
packet size is 23715,is key frame 0
packet size is 23686,is key frame 0
packet size is 23829,is key frame 0
packet size is 23774,is key frame 0
packet size is 23850,is key frame 0
FFMPEG doesn't reset the mfxEncodeCtrl's FrameType when encoding the next frame, it causes every frame after key frame to be IDR frame

AudioUnit output buffer and input buffer

My question is what should I do when I use real-time time stretch?
I understand that the change of rate will change the count of samples for output.
For example, if I stretch audio with 2.0 coefficient, the output buffer is bigger (twice).
So, what should I do if I create reverb, delay or real-time time stretch?
For example, my input buffer is 1024 samples. Then I stretch audio with 2.0 coefficient. Now my Buffer is 2048 samples.
In this code with superpowered audio stretch, everything is work. But if I do not change the rate... When I change rate - it sounds with distortion without actual change of speed.
return ^AUAudioUnitStatus(AudioUnitRenderActionFlags *actionFlags,
const AudioTimeStamp *timestamp,
AVAudioFrameCount frameCount,
NSInteger outputBusNumber,
AudioBufferList *outputBufferListPtr,
const AURenderEvent *realtimeEventListHead,
AURenderPullInputBlock pullInputBlock ) {
pullInputBlock(actionFlags, timestamp, frameCount, 0, renderABLCapture);
Float32 *sampleDataInLeft = (Float32*) renderABLCapture->mBuffers[0].mData;
Float32 *sampleDataInRight = (Float32*) renderABLCapture->mBuffers[1].mData;
Float32 *sampleDataOutLeft = (Float32*)outputBufferListPtr->mBuffers[0].mData;
Float32 *sampleDataOutRight = (Float32*)outputBufferListPtr->mBuffers[1].mData;
SuperpoweredAudiobufferlistElement inputBuffer;
inputBuffer.samplePosition = 0;
inputBuffer.startSample = 0;
inputBuffer.samplesUsed = 0;
inputBuffer.endSample = frameCount;
inputBuffer.buffers[0] = SuperpoweredAudiobufferPool::getBuffer(frameCount * 8 + 64);
inputBuffer.buffers[1] = inputBuffer.buffers[2] = inputBuffer.buffers[3] = NULL;
SuperpoweredInterleave(sampleDataInLeft, sampleDataInRight, (Float32*)inputBuffer.buffers[0], frameCount);
timeStretch->setRateAndPitchShift(1.0f, -2);
timeStretch->process(&inputBuffer, outputBuffers);
if (outputBuffers->makeSlice(0, outputBuffers->sampleLength)) {
int numSamples = 0;
int samplesOffset =0;
while (true) {
Float32 *timeStretchedAudio = (Float32 *)outputBuffers->nextSliceItem(&numSamples);
if (!timeStretchedAudio) break;
SuperpoweredDeInterleave(timeStretchedAudio, sampleDataOutLeft + samplesOffset, sampleDataOutRight + samplesOffset, numSamples);
samplesOffset += numSamples;
return noErr;
So, how can I create my Audio Unit render block, when my input and output buffers have the different count of samples (reverb, delay or time stretch)?
If your process creates more samples than provided by the audio callback input/output buffer size, you have to save those samples and play them later, by mixing in with subsequent output in a later audio unit callback if necessary.
Often circular buffers are used to decouple input, processing, and output sample rates or buffer sizes.

FFMPEG: Dumping YUV data into AVFrame structure

I'm trying to dump a YUV420 data into the AVFrame structure of FFMPEG. From the below link:
http://ffmpeg.org/doxygen/trunk/structAVFrame.html, i can derive that i need to put my data into
The YUV data i'm trying to dump is YUV420 and the picture size is 416x240. So how do i dump/map this yuv data to AVFrame structures variable? Iknow that linesize represents the stride i.e. i suppose the width of my picture, I have tried with some combinations but do not get the output.I kindly request you to help me map the buffer. Thanks in advance.
AVFrame can be interpreted as an AVPicture to fill the data and linesize fields. The easiest way to fill these field is to the use the avpicture_fill function.
To fill in the AVFrame's Y U and V buffers, it depends on your input data and what you want to do with the frame (do you want to write into the AVFrame and erase the initial data? or keep a copy).
If the buffer is large enough (at least linesize[0] * height for Y data, linesize[1 or 2] * height/2 for U/V data), you can directly use input buffers:
// Initialize the AVFrame
AVFrame* frame = avcodec_alloc_frame();
frame->width = width;
frame->height = height;
frame->format = AV_PIX_FMT_YUV420P;
// Initialize frame->linesize
avpicture_fill((AVPicture*)frame, NULL, frame->format, frame->width, frame->height);
// Set frame->data pointers manually
frame->data[0] = inputBufferY;
frame->data[1] = inputBufferU;
frame->data[2] = inputBufferV;
// Or if your Y, U, V buffers are contiguous and have the correct size, simply use:
// avpicture_fill((AVPicture*)frame, inputBufferYUV, frame->format, frame->width, frame->height);
If you want/need to manipulate a copy of input data, you need to compute the needed buffer size, and copy input data in it.
// Initialize the AVFrame
AVFrame* frame = avcodec_alloc_frame();
frame->width = width;
frame->height = height;
frame->format = AV_PIX_FMT_YUV420P;
// Allocate a buffer large enough for all data
int size = avpicture_get_size(frame->format, frame->width, frame->height);
uint8_t* buffer = (uint8_t*)av_malloc(size);
// Initialize frame->linesize and frame->data pointers
avpicture_fill((AVPicture*)frame, buffer, frame->format, frame->width, frame->height);
// Copy data from the 3 input buffers
memcpy(frame->data[0], inputBufferY, frame->linesize[0] * frame->height);
memcpy(frame->data[1], inputBufferU, frame->linesize[1] * frame->height / 2);
memcpy(frame->data[2], inputBufferV, frame->linesize[2] * frame->height / 2);
Once you are done with the AVFrame, do not forget to free it with av_frame_free (and any buffer allocated by av_malloc).
FF_API int ff_get_format_plane_size(int fmt, int plane, int scanLine, int height)
const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
if (desc)
int h = height;
if (plane == 1 || plane == 2)
h = FF_CEIL_RSHIFT(height, desc->log2_chroma_h);
return h*scanLine;
