How many buffers is enough in ALLOCATOR_PROPERTIES::cBuffers? - winapi

I'm working on a custom video transform filter derived from CTransformFilter. It doesn't do anything unusual in DirectShow terms such as extra internal buffering of media samples, queueing of output samples or dynamic format changes.
Graphs in graphedit containing two instance of my filters connected end to end (output of first connected to input of the second) hang when play is pressed. The graph definitely is not hanging within the ::Transform method override. The second filter instance is not connected directly to a video renderer.
The problem doesn't happen if a colour converter is inserted between the two filters. If I increase the number of buffers requested (ALLOCATOR_PROPERTIES::cBuffers) from 1 to 3 then the problem goes away. The original DecideBufferSize override is below and is similar to lots of other sample DirectShow filter code.
What is a robust policy for setting the number of requested buffers in a DirectShow filter (transform or otherwise)? Is code that requests one buffer out of date for modern requirements? Is my problem too few buffers or is increasing the number of buffers masking a different problem?
HRESULT MyFilter::DecideBufferSize(IMemAllocator *pAlloc, ALLOCATOR_PROPERTIES *pProp)
{
AM_MEDIA_TYPE mt;
HRESULT hr = m_pOutput->ConnectionMediaType(&mt);
if (FAILED(hr)) {
return hr;
}
BITMAPINFOHEADER * const pbmi = GetBitmapInfoHeader(mt);
pProp->cbBuffer = DIBSIZE(*pbmi);
if (pProp->cbAlign == 0) {
pProp->cbAlign = 1;
}
if (pProp->cBuffers == 0) {
pProp->cBuffers = 3;
}
// Release the format block.
FreeMediaType(mt);
// Set allocator properties.
ALLOCATOR_PROPERTIES Actual;
hr = pAlloc->SetProperties(pProp, &Actual);
if (FAILED(hr)) {
return hr;
}
// Even when it succeeds, check the actual result.
if (pProp->cbBuffer > Actual.cbBuffer) {
return E_FAIL;
}
return S_OK;
}

There is no specific policy on amount of buffers, though you should definitely be aware that fixed number of buffers is the method to control sample rate. When all buffers are in use, a request for another buffer will block execution until such buffer is available.
That is, if your code is holding buffer references for certain purpose, you should allocate the respective amount so that you don't lock yourself. E.g. you hold last media sample reference internally, e.g. to be able to re-send it, and you still want to be able to deliver other media samples, so you need at least two buffers on the allocator.
Output pin is typically responsible to choose and set up the allocator, and input might might need to check and update properties if/when it is notified which allocator is to be used. On inplace transformation filters when you share the allocators, you might want an additional check in order to make sure the requirements are met.
DMO Wrapper Filter uses (at least sometimes) allocators with one buffer only and is still in good standing
with audio you normally have more buffers because you queue data for playback
if you have a reference leak on your code and you don't release media sample pointers, then your streaming might lock dead because of this

Related

How to send multipart messages using libnl and generic netlink?

I'm trying to send a relatively big string (6Kb) through libnl and generic netlink, however, I'm receiving the error -5 (NL_ENOMEM) from the function nla_put_string in this process. I've made a lot of research but I didn't find any information about these two questions:
What's the maximum string size supported by generic netlink and libnl nla_put_string function?
How to use the multipart mechanism of generic netlink to broke this string in smaller parts to send and reassemble it on the Kernel side?
If there is a place to study such subject I appreciate that.
How to use the multipart mechanism of generic netlink to broke this string in smaller parts to send and reassemble it on the Kernel side?
Netlink's Multipart feature "might" help you transmit an already fragmented string, but it won't help you with the actual string fragmentation operation. That's your job. Multipart is a means to transmit several small correlated objects through several packets, not one big object. In general, Netlink as a whole is designed with the assumption that any atomic piece of data you want to send will fit in a single packet. I would agree with the notion that 6Kbs worth of string is a bit of an oddball.
In actuality, Multipart is a rather ill-defined gimmic in my opinion. The problem is that the kernel doesn't actually handle it in any generic capacity; if you look at all the NLMSG_DONE usage instances, you will notice not only that it is very rarely read (most of them are writes), but also, it's not the Netlink code but rather some specific protocol doing it for some static (ie. private) operation. In other words, the semantics of NLMSG_DONE are given by you, not by the kernel. Linux will not save you any work if you choose to use it.
On the other hand, libnl-genl-3 does appear to perform some automatic juggling with the Multipart flags (NLMSG_DONE and NLM_F_MULTI), but that only applies when you're sending something from Kernelspace to Userspace, and on top of that, even the library itself admits that it doesn't really work.
Also, NLMSG_DONE is supposed to be placed in the "type" Netlink header field, not in the "flags" field. This is baffling to me, because Generic Netlink stores the family identifier in type, so it doesn't look like there's a way to simultaneously tell Netlink that the message belongs to you, AND that it's supposed to end some data stream. Unless I'm missing something important, Multipart and Generic Netlink are incompatible with each other.
I would therefore recommend implementing your own message control if necessary and forget about Multipart.
What's the maximum string size supported by generic netlink and libnl nla_put_string function?
It's not a constant. nlmsg_alloc() reserves
getpagesize() bytes per packet by default. You can tweak this default with nlmsg_set_default_size(), or more to the point you can override it with nlmsg_alloc_size().
Then you'd have to query the actual allocated size (because it's not guaranteed to be what you requested) and build from there. To get the available payload you'd have to subtract the Netlink header length, the Generic Header length and the Attribute Header lengths for any attributes you want to add. Also the user header length, if you have one. You would also have to align all these components because their sizeof is not necessarily their actual size (example).
All that said, the kernel will still reject packets which exceed the page size, so even if you specify a custom size you will still need to fragment your string.
So really, just forget all of the above. Just fragment the string to something like getpagesize() / 2 or whatever, and send it in separate chunks.
This is the general idea:
static void do_request(struct nl_sock *sk, int fam, char const *string)
{
struct nl_msg *msg;
msg = nlmsg_alloc();
genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, fam,
0, 0, DOC_EXMPL_C_ECHO, 1);
nla_put_string(msg, DOC_EXMPL_A_MSG, string);
nl_send_auto(sk, msg);
nlmsg_free(msg);
}
int main(int argc, char **argv)
{
struct nl_sock *sk;
int fam;
sk = nl_socket_alloc();
genl_connect(sk);
fam = genl_ctrl_resolve(sk, FAMILY_NAME);
do_request(sk, fam, "I'm sending a string.");
do_request(sk, fam, "Let's pretend I'm biiiiiig.");
do_request(sk, fam, "Look at me, I'm so big.");
do_request(sk, fam, "But I'm already fragmented, so it's ok.");
nl_close(sk);
nl_socket_free(sk);
return 0;
}
I left a full sandbox in my Dropbox. See the README. (Tested in kernel 5.4.0-37-generic.)

How to best organize constant buffers

I'm having some trouble wrapping my head around how to organize the constant buffers in a very basic D3D11 engine I'm making.
My main question is: Where does the biggest performance hit take place? When using Map/Unmap to update buffer data or when binding the cbuffers themselves?
At the moment, I'm deciding between the following two implementations for a sort of "shader-wrapper" class:
Holding an array of 14 ID3D11Buffer*s
class VertexShader
{
...
public:
Bind(context)
{
// Bind all 14 buffers at once
context->VSSetConstantBuffers(0, 14, &m_ppCBuffers[0]);
context->VSSetShader(pVS, nullptr, 0);
}
// Set the data for a buffer in a particular slot
SetData(slot, size, pData)
{
D3D11_MAPPED_SUBRESOURCE mappedBuffer = {};
context->Map(buffers[slot], 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedBuffer);
memcpy(mappedBuffer.pData, pData, size);
context->Unmap(buffers[slot], 0);
}
private:
ID3D11Buffer* buffers[14];
ID3D11VertexShader* pVS;
}
This approach would have the shader bind all the cbuffers in a single batch of 14. If the shader has cbuffers registered to b0, b1, b3 the array would look like -> [cb|cb|0|cb|0|0|0|0|0|0|0|0|0|0]
Constant Buffer wrapper that knows how to bind itself
class VertexShader
{
...
public:
Bind(context)
{
// all the buffers bind themselves
for(auto cb : bufferMap)
cb->Bind(context);
context->VSSetShader(pVS, nullptr, 0);
}
// Set the data for a buffer with a particular ID
SetData(std::string, size, pData)
{
// table lookup into bufferMap, then Map/Unmap
}
private:
std::unordered_map<std::string, ConstantBuffer*> bufferMap;
ID3D11VertexShader* pVS;
}
This approach would hold "ConstantBuffers" in a hash table, each one would know what slot it's bound to and how to bind itself to the pipeline. I would have to call VSSetConstantBuffers() individually for each cbuffer since the ID3D11Buffer*s wouldn't be contiguous anymore, but the organization is friendlier and has a bit less wasted space.
How would you typically organize the relationship between CBuffers, Shaders, SRVs, etc? Not looking for a do-all solution, but some general advice and things to read more about from people hopefully more experienced than I am
Also if #Chuck Walbourn sees this, I'm a fan of your work and using DXTK/WiCTextureLoader for this project!
Thanks.
Constant Buffers were a major feature of Direct3D 10, so one of the best talks on the subject was given way back at Gamefest 2007:
Windows to Reality: Getting the Most out of Direct3D 10 Graphics in Your Games
See also Why Can Updating Constant Buffers be so painfully slow? (NVIDIA)
The original intention was for CBs to be organized by frequency of update: something like one CB for stuff that is set 'per level', another for stuff 'per frame', another for 'per object', another 'per pass' etc. Therefore the assumption is that if you changed any part of a CB, you were going to be uploading the whole thing. Bandwdith between the CPU and GPU is the real bottleneck here.
For this approach to be effective, you basically need to set up all your shaders to use the same scheme. This can be difficult to manage, especially when so many modern material systems are art-driven.
Another approach to CBs is to use them like a dynamic VB for particles submission where you fill it up with short-lived constants, submit work, and then reset the thing each frame. This approach is basically what people do for DirectX 12 in many cases. The problem is that without the ability to update parts of CBs, it's too slow. The "partial constant buffer updates and offsets' optional features in DirectX 11.1 were a way to make this work. That said, this feature is not supported on Windows 7 and is 'optional' on newer versions of Windows, so you have to support two codepaths to use it.
TL;DR: you can technically have a lot of CBs bound at once, but the key thing is to keep the individual size small for the ones that change often. Also assume any change to a CB is going to require updating the whole thing to the GPU every time you do change it.

How to properly use MIDIReadProc?

According to apple's docs it says:
Because your MIDIReadProc callback is invoked from a separate thread,
be aware of the synchronization issues when using data provided by
this callback.
Does this mean, use #synchronize to do thread blocking for safety?
Or does this literally mean synchronization timing issues may happen?
I am currently trying to read a midi file, and use a MIDIReadProc to trigger the note-on / note-off of a software synth based off of midi events. I need this to be extremely reliable and perfectly in-time. Right now, I am noticing that when I consume these midi events and write the audio to a buffer (all done from the MIDIReadProc), the timing is extremely sloppy and not sounding right at all. So I would like to know, what is the "proper" way to consume midi events from a MIDIReadProc?
Also, is a MIDIReadProc the only option for consuming midi events from a midi file?
Is there another option as far as setting up a virtual endpoint that could be directly consumed by my synthesizer? If so, how does that work exactly?
If you presume a function of this format to be the midiReadProc,
void midiReadProc(const MIDIPacketList *packetList,
void* readProcRefCon,
void* srcConnRefCon)
{
MIDIPacket *packet = (MIDIPacket*)packetList->packet;
int count = packetList->numPackets;
for (int k=0; k<count; k++) {
Byte midiStatus = packet->data[0];
Byte midiChannel= midiStatus & 0x0F;
Byte midiCommand = midiStatus >> 4;
//parse MIDI messages, extract relevant information and pass it to the controller
//controller must be visible from the midiReadProc
}
packet = MIDIPacketNext(packet);
}
the MIDI client has to be declared in the controller, interpreted MIDI events get stored into the controller from MIDI callback and read by the audioRenderCallback() on each audio render cycle. This way you can minimize timing imprecisions to the
length of the audio buffer, which you can negotiate during AudioUnit setup to be as short as the system allows for.
A controller can be a #interface myMidiSynthController : NSViewController you define, consisting of a matrix of MIDI channels and a pre-determined maximum-polyphony-per-channel, among other relevant data such as interface elements, phase accumulators for each active voice, AudioComponentInstance, etc... It would be wrong to resize the controller based on the midiReadProc() input. RAM is cheap nowadays.
I'm using such MIDI callbacks for processing live input from MIDI devices. Concerning playback of MIDI files, if you
want to process streams or files of arbitrary complexity, you may also run into surprises. MIDI standard itself
has timing features, which work as good as MIDI hardware allows for. Once you read an entire file into the memory, you can
translate your data into whatever you want and use your own code for controlling sound synthesis.
Please, observe not to use any code which would block the audio render thread (i.e. inside audioRenderCallback()), or would do memory management on it.
You could use AVAudioEngine.musicSequence and prepare your audio unit graph. Then use the MusicSequence API to load your GM file. Like this you don’t need to do the timing by yourself. Note I have not done this myself so far but I understand in theory it should work like this.
After you instantiate your synthesizer audio unit, you attach and connect it to the AVAudioEngine graph.
Does this mean, use #synchronize to do thread blocking for safety?
The opposite of what you’ve said is true: You should certainly not lock in a realtime thread. The #synchronized directive will lock if the resource is already locked. You may consider to use lock-free queues for realtime threads. See also Four common mistakes in audio development.
If you have to use CoreMIDI and MIDIReadProc, you can send MIDI commands to the synthesizer audio unit by calling MusicDeviceMIDIEvent right from your callback.

Native way to get the feature report descriptor of HID device?

We have some HID devices (touch digitizers) that communicate with an internal R&D tool. This tool parses the raw feature reports from the devices to draw the touch reports along with some additional data that are present in the raw feature report but filtered out by the HID driver of Windows 7 (eg, pressure data is not present in WM_TOUCH messages).
However, we have started working with some devices that may have different firmware variants, and thus that do not share the same ordering or bytelength of the fields and I need to modify our R&D tool so that it will adapt transparently to all the devices.
The devices come from the same manufacturer (ourselves) and share the same device info, so using these fields to differentiate between the different firmwares is not an option. What I would like to do is to get the HID feature report descriptor sent by the device and update dynamically our feature report parsing method based on this information.
However, I didn't manage to find the correct method to call in order to get this descriptor when browsing the Windows API. What I have found so far is the Raw Input page on MSDN, but I'm not sure what to do next. Can I find the required information in the RID_DEVICE_HID structure ? Or do I need to call a completely different API ?
Thanks in advance for your help!
Ok, finally I've got something (almost completely) functional. As inferred by mcoill, I used the HidP_xxx() family of functions, but it needs a little bit of data preparation first.
I based my solution on this example code that targets USB joysticks and adapted it to touch digitizer devices.
If someone else also gets confused by the online doc, here are the required steps involved in the process:
registering the application for a Raw Input device at launch.
This is done by calling the function RegisterRawInputDevice(&Rid, 1, sizeof(Rid)), where Rid is a RAWINPUTDEVICE with the following properties set (in order to get a touch digitizer) :
Rid.usUsage = 0x04;
Rid.usUsagePage = 0x0d;
Rid.dwFlags = RIDEV_INPUT_SINK;
registering a callback OnInput(LPARAM lParam) for the events WM_INPUT since the Rid device will generate this type of events;
the OnInput(LPARAM lParam) method will get the data from this event in two steps:
// Parse the raw input header to read its size.
UINT bufferSize;
GetRawInputData(HRAWINPUT)lParam, RID_INPUT, NULL, &bufferSize, sizeof(RAWINPUTHEADER));
// Allocate memory for the raw input data and retrieve it
PRAWINPUT = (PRAWINPUT)HeapAlloc(GetProcessHeap(), 0, bufferSize);
GetRawInputData(HRAWINPUT)lParam, RID_INPUT, rawInput /* NOT NULL */, &bufferSize, sizeof(RAWINPUTHEADER));
it then calls a parsing method that creates the HIDP_PREPARSED_DATA structure required by the lookup functions:
// Again, read the data size, allocate then retrieve
GetRawInputDeviceInfo(rawInput->header.hDevice, RIDI_PREPARSEDDATA, NULL, &bufferSize);
PHIDP_PREPARSED_DATA preparsedData = (PHIDP_PREPARSED_DATA)HeapAlloc(heap, 0, bufferSize);
GetRawInputDeviceInfo(rawInput->header.hDevice, RIDI_PREPARSEDDATA, preparsedData, &bufferSize);
The preparsed data is split into capabilities:
// Create a structure that will hold the values
HidP_GetCaps(preparsedData, &caps);
USHORT capsLength = caps.NumberInputValueCaps;
PHIDP_VALUE_CAPS valueCaps = (PHIDP_VALUE_CAPS)HeapAlloc(heap, 0, capsLength*sizeof(HIDP_VALUE_CAPS));
HidP_GetValueCaps(HidP_Input, valueCaps, &capsLength, preparsedData);
And capabilities can be asked for their value:
// Read sample value
HidP_GetUsageValue(HidP_Input, valueCaps[i].UsagePage, 0, valueCaps[i].Range.UsageMin, &value, preparsedData, (PCHAR)rawInput->data.hid.bRawData, rawInput->data.hid.dwSizeHid);
Wouldn't HidP_GetPReparsedData(...), HidP_GetValueCaps(HidP_Feature, ...) and their ilk give you enough information without having to get the raw feature report?
HIDClass Support Routines on MSDN

How to use audioConverterFillComplexBuffer and its callback?

I need a step by step walkthrough on how to use audioConverterFillComplexBuffer and its callback. No, don't tell me to read the Apple docs. I do everything they say and the conversion always fails. No, don't tell me to go look for examples of audioConverterFillComplexBuffer and its callback in use - I've duplicated about a dozen such examples both line for line and modified and the conversion always fails. No, there isn't any problem with the input data. No, it isn't an endian issue. No, the problem isn't my version of OS X.
The problem is that I don't understand how audioConverterFillComplexBuffer works, so I don't know what I'm doing wrong. And nothing out there is helping me understand, because it seems like nobody on Earth really understands how audioConverterFillComplexBuffer works, either. From the people who actually use it(I spy cargo cult programming in their code) to even the authors of Learning Core Audio and/or Apple itself(http://stackoverflow.com/questions/13604612/core-audio-how-can-one-packet-one-byte-when-clearly-one-packet-4-bytes).
This isn't just a problem for me, it's a problem for anybody who wants to program high-performance audio on the Mac platform. Threadbare documentation that's apparently wrong and examples that don't work are no fun.
Once again, to be clear: I NEED A STEP BY STEP WALKTHROUGH ON HOW TO USE audioConverterFillComplexBuffer plus its callback and so does the entire Mac developer community.
This is a very old question but I think is still relevant. I've spent a few days fighting this and have finally achieved a successful conversion. I'm certainly no expert but I'll outline my understanding of how it works. Note I'm using Swift, which I'm also just learning.
Here are the main function arguments:
inAudioConverter: AudioConverterRef: This one is simple enough, just pass in a previously created AudioConverterRef.
inInputDataProc: AudioConverterComplexInputDataProc: The very complex callback. We'll come back to this.
inInputDataProcUserData, UnsafeMutableRawPointer?: This is a reference to whatever data you may need to be provided to the callback function. Important because even in swift the callback can't inherit context. E.g. you may need to access an AudioFileID or keep track of the number of packets read so far.
ioOutputDataPacketSize: UnsafeMutablePointer<UInt32>: This one is a little misleading. The name implies it's the packet size but reading the documentation we learn it's the total number of packets expected for the output format. You can calculate this as outPacketCount = frameCount / outStreamDescription.mFramesPerPacket.
outOutputData: UnsafeMutablePointer<AudioBufferList>: This is an audio buffer list which you need to have already initialized with enough space to hold the expected output data. The size can be calculated as byteSize = outPacketCount * outMaxPacketSize.
outPacketDescription: UnsafeMutablePointer<AudioStreamPacketDescription>?: This is optional. If you need packet descriptions, pass in a block of memory the size of outPacketCount * sizeof(AudioStreamPacketDescription).
As the converter runs it will repeatedly call the callback function to request more data to convert. The main job of the callback is simply to read the requested number packets from the source data. The converter will then convert the packets to the output format and fill the output buffer. Here are the arguments for the callback:
inAudioConverter: AudioConverterRef: The audio converter again. You probably won't need to use this.
ioNumberDataPackets: UnsafeMutablePointer<UInt32>: The number of packets to read. After reading, you must set this to the number of packets actually read (which may be less than the number requested if we reached the end).
ioData: UnsafeMutablePointer<AudioBufferList>: An AudioBufferList which is already configured except for the actual data. You need to initialise ioData.mBuffers.mData with enough capacity to hold the expected number of packets, i.e. ioNumberDataPackets * inMaxPacketSize. Set the value of ioData.mBuffers.mDataByteSize to match.
outDataPacketDescription: UnsafeMutablePointer<UnsafeMutablePointer<AudioStreamPacketDescription>?>?: Depending on the formats used, the converter may need to keep track of packet descriptions. You need to initialise this with enough capacity to hold the expected number of packet descriptions.
inUserData: UnsafeMutableRawPointer?: The user data that you provided to the converter.
So, to start you need to:
Have sufficient information about your input and output data, namely the number of frames and maximum packet sizes.
Initialise an AudioBufferList with sufficient capacity to hold the output data.
Call AudioConverterFillComplexBuffer.
And on each run of the callback you need to:
Initialise ioData with sufficient capacity to store ioNumberDataPackets of source data.
Initialise outDataPacketDescription with sufficient capacity to store ioNumberDataPackets of AudioStreamPacketDescriptions.
Fill the buffer with source packets.
Write the packet descriptions.
Set ioNumberDataPackets to the number of packets actually read.
return noErr if successful.
Here's an example where I read the data from an AudioFileID:
var converter: AudioConverterRef?
// User data holds an AudioFileID, input max packet size, and a count of packets read
var uData = (fRef, maxPacketSize, UnsafeMutablePointer<Int64>.allocate(capacity: 1))
err = AudioConverterNew(&inStreamDesc, &outStreamDesc, &converter)
err = AudioConverterFillComplexBuffer(converter!, { _, ioNumberDataPackets, ioData, outDataPacketDescription, inUserData in
let uData = inUserData!.load(as: (AudioFileID, UInt32, UnsafeMutablePointer<Int64>).self)
ioData.pointee.mBuffers.mDataByteSize = uData.1
ioData.pointee.mBuffers.mData = UnsafeMutableRawPointer.allocate(byteCount: Int(uData.1), alignment: 1)
outDataPacketDescription?.pointee = UnsafeMutablePointer<AudioStreamPacketDescription>.allocate(capacity: Int(ioNumberDataPackets.pointee))
let err = AudioFileReadPacketData(uData.0, false, &ioData.pointee.mBuffers.mDataByteSize, outDataPacketDescription?.pointee, uData.2.pointee, ioNumberDataPackets, ioData.pointee.mBuffers.mData)
uData.2.pointee += Int64(ioNumberDataPackets.pointee)
return err
}, &uData, &numPackets, &bufferList, nil)
Again, I'm no expert, this is just what I've learned by trial and error.

Resources