Can you re-use buffers with Windows wave audio input? - windows

I'm using the Windows multimedia APIs to record and process wave audio (waveInOpen and friends). I'd like to use a small number of buffers in a round robin fashion.
I know that you're supposed to use waveInPrepareHeader before adding a buffer to the device, and that you're supposed to call waveInUnprepareHeader after the wave device has "returned the buffer to the application" and before you deallocate it.
My question is, do I have to unprepare and re-prepare in order to re-use a buffer? Or can I just add a previously used buffer back to the device?
Also, does it matter what thread I do this on? I'm using the callback function, which seems to be called on a worker thread that belongs to the audio system. Can I call waveInUnprepareHeader, waveInPrepareHeader, and waveInAddBuffer on that thread, during the callback?

Yes, my experience has been you need to call prepare and unprepare every time. From memory, it returns an error if you try to reuse the same one.
And you typically call the prepare and unprepare on whatever thread you are handling the callbacks on.

When you create the buffers, call waveInPrepareHeader. Then you can simply set the prepared flag before you call waveInAddBuffer on a buffer that was returned from the device.
pHdr->dwFlags = WHDR_PREPARED;
You can do this on the callback thread (or in the message handler).

Related

How to use av_frame_unref in ffmpeg?

Can I create one AVFrame and use it for decoding of all frames? So I call av_frame_alloc() once, decode all frames, and then call av_frame_unref() once. Or I should call av_frame_alloc / av_frame_unref for each frame?
When exactly I should call av_frame_unref, before decoding or after decoding?
A.
av_frame_alloc()
av_frame_unref()
(decoding...)
B. Or this variant:
av_frame_alloc()
(decoding...)
av_frame_unref()
You could only use one AVFrame struct for the entire decoding/encoding process, by calling av_frame_alloc() once.
av_frame_unref() is only needed when you decide to enable reference counting for your encoding/decoding context, called for each frame.
Use av_frame_free() to free the frame struct and all its buffers at the end of your encoding/decoding process.
See ffmpeg's official examples for how to use them:
demuxing_decoding
Reference counting is a generic process where a dynamically allocated source is shared (e.g. among multiple threads). To prevent freeing the source by a thread and put others offside position, this mechanism is used, often implemented with a simple atomic counter associated with the object.
The threads which accesses the source calls addref, that typically will increase the counter by 1 and when the thread is done using it calls unref and that decreases the counter (in case of ffmpeg these are av_frame_ref and av_frame_unref if I'm not mistaken).
This ensures that the source remains valid until the thread using it is done with it.
Eventually, when the counter reaches zero (no user left), the source is freed safely.

How to avoid data copy in NDIS filter driver

I am working on a NDIS filter driver which actually copies data from NET_BUFFERs to driver allocated buffers in the Send path and push these driver allocated buffers into a internal queue. Later on, the data is copied again from these driver allocated buffers in the queue to IRP buffers. I want to avoid this copy of data.
In Linux, we can create a clone of skbuff and the cloned skbuff can be queued for later use. Is there a similar option available in Windows as well? If there a way to clone the NET_BUFFER, we can simply avoid the first copy that is happening from NET_BUFFER to driver allocated memory buffers.
If there exists a way to achieve zero copy from the NetBufferLists to IRP buffers, then it would really be an ideal solution. It would be really helpful if someone can suggest a better solution to avoid the copies in the send path.
It's not clear to me why you need to copy the NB (NET_BUFFER) at all. If you plan to enqueue the NB for processing on a different thread, you can do that with the original NB — no need to copy anything.
The only reason here that you'd need to copy the payload is if you plan to hang onto the buffer for a while (say, more than 1000ms). At a high level, the payload associated with an NB belongs to the application. NDIS permits you to queue the NB, do some processing, drop it, modify it, etc. But (depending on socket options) the application may be stuck until its buffer is completed back to it. So you cannot hang onto the original NB or its payload indefinitely. If you're going to do something that takes a long time then you should allocate a deep copy of all the datastructures you need (the NBL, the NB, the MDL, and the payload buffer) and return the originals back to the application.
If you're stuffing the packet payload into an IRP so that a usermode process can contemplate the payload, then you really do need 1 copy. The reason is that kernel can't trust any usermode process to do anything within a particular time budget. Imagine, for example, that the system is going to hibernate. The kernel duly suspends all usermode processes, then waits for each device to go a low power state. But the network card can't go to low power, because the datapath won't pause because some packet is stuck in your filter driver, waiting for the (now suspended) usermode process to reply. Thus, you protect yourself by detaching the IO to usermode with the IO over the network device: make a copy.
But if all you're doing is shipping the packet off to another kernel device that does (say) encryption, then you can assume that the encryption device assures a reasonable time budget, so it may be safe to give the original packet payload to it.

How to properly use MIDIReadProc?

According to apple's docs it says:
Because your MIDIReadProc callback is invoked from a separate thread,
be aware of the synchronization issues when using data provided by
this callback.
Does this mean, use #synchronize to do thread blocking for safety?
Or does this literally mean synchronization timing issues may happen?
I am currently trying to read a midi file, and use a MIDIReadProc to trigger the note-on / note-off of a software synth based off of midi events. I need this to be extremely reliable and perfectly in-time. Right now, I am noticing that when I consume these midi events and write the audio to a buffer (all done from the MIDIReadProc), the timing is extremely sloppy and not sounding right at all. So I would like to know, what is the "proper" way to consume midi events from a MIDIReadProc?
Also, is a MIDIReadProc the only option for consuming midi events from a midi file?
Is there another option as far as setting up a virtual endpoint that could be directly consumed by my synthesizer? If so, how does that work exactly?
If you presume a function of this format to be the midiReadProc,
void midiReadProc(const MIDIPacketList *packetList,
void* readProcRefCon,
void* srcConnRefCon)
{
MIDIPacket *packet = (MIDIPacket*)packetList->packet;
int count = packetList->numPackets;
for (int k=0; k<count; k++) {
Byte midiStatus = packet->data[0];
Byte midiChannel= midiStatus & 0x0F;
Byte midiCommand = midiStatus >> 4;
//parse MIDI messages, extract relevant information and pass it to the controller
//controller must be visible from the midiReadProc
}
packet = MIDIPacketNext(packet);
}
the MIDI client has to be declared in the controller, interpreted MIDI events get stored into the controller from MIDI callback and read by the audioRenderCallback() on each audio render cycle. This way you can minimize timing imprecisions to the
length of the audio buffer, which you can negotiate during AudioUnit setup to be as short as the system allows for.
A controller can be a #interface myMidiSynthController : NSViewController you define, consisting of a matrix of MIDI channels and a pre-determined maximum-polyphony-per-channel, among other relevant data such as interface elements, phase accumulators for each active voice, AudioComponentInstance, etc... It would be wrong to resize the controller based on the midiReadProc() input. RAM is cheap nowadays.
I'm using such MIDI callbacks for processing live input from MIDI devices. Concerning playback of MIDI files, if you
want to process streams or files of arbitrary complexity, you may also run into surprises. MIDI standard itself
has timing features, which work as good as MIDI hardware allows for. Once you read an entire file into the memory, you can
translate your data into whatever you want and use your own code for controlling sound synthesis.
Please, observe not to use any code which would block the audio render thread (i.e. inside audioRenderCallback()), or would do memory management on it.
You could use AVAudioEngine.musicSequence and prepare your audio unit graph. Then use the MusicSequence API to load your GM file. Like this you don’t need to do the timing by yourself. Note I have not done this myself so far but I understand in theory it should work like this.
After you instantiate your synthesizer audio unit, you attach and connect it to the AVAudioEngine graph.
Does this mean, use #synchronize to do thread blocking for safety?
The opposite of what you’ve said is true: You should certainly not lock in a realtime thread. The #synchronized directive will lock if the resource is already locked. You may consider to use lock-free queues for realtime threads. See also Four common mistakes in audio development.
If you have to use CoreMIDI and MIDIReadProc, you can send MIDI commands to the synthesizer audio unit by calling MusicDeviceMIDIEvent right from your callback.

Linux Driver and API architecture for a data acquisition device

We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.
You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.

Rs232 software flow control

I have a general question about Rs232 Software Flowcontrol (aka XOn/XOff)
The .Net implementation (and the nativ win32 api) bothe define a property called WriteTimeout / ReadTimeout, which is a time in ms after which a communication is considered to be overdue.
No my problem is this: If I send, lets say a 5 Byte string to the device I don't see any WriteTimeout, as expected. How is this implemented? Everything I find about Software flow control is that XOFF is to be set, when the recieve buffer is full; XOn when it is ready to recieve again.
But from the behavior I see, I would suspect, hat the device sends XON, after it has processed the 5-Byte information that I send, thus creating the information for windows to generate the corresponding events.
So when to send XON on a two-wire only RS232 implementation? Only if the buffer was full and to restart recieving; Or to signal, that we are "still ready" to receive after every chunk we processed?
How to implement?
Cheers & thx in advance!
Corelgott
Send an XON any time you are ready to receive data (your receive buffer is empty or nearly so). Send an XOFF any time you cannot accept more incoming data (your receive buffer is full or nearly so). The process is documented on the Wikipedia software flow control page.

Resources