Convert LPCM buffer to AAC for HTTP Live Streaming - macos

I have an application that records audio from devices into a Float32 (LPCM) buffer.
However, LPCM needs to be encoded in an audio format (MP3, AAC) to be used as a media segment to be streamed, according to the HTTP Live Streaming specifications. I have found some useful resources on how to convert a LPCM file to an AAC / MP3 file but this is not exactly what I am looking for, since I am not willing to convert a file but a buffer.
What are the main differences between converting an audio file and a raw audio buffer (LPCM, Float32)? Is the latter more trivial?
My initial thought was to create a thread that would regularly fetch data from a ring buffer (where the raw audio is stored) and convert it to a a valid audio format (either AAC or MP3).
Would it be more sensible to do so immediately when the AudioBuffer is captured through a AURenderCallback and hence pruning the ring buffer?
Thanks for your help,

The core audio recording buffer length and the desired audio file length are rarely always exactly the same. So it might be better to poll your circular/ring buffer (you know the sample rate, which should tell approximately how often) to decouple the two rates, and convert the buffer (if filled sufficiently) to a file at a later time. You can memory map a raw audio file to the buffer, but there may or may not be any performance difference between that, and async writing a temp file.

Related

ffmpeg decoding slow calling without avformat_find_stream_info

I am decoding h264 rtp stream with ffmpeg on android. I found a strange problem: if I don't call avformat_find_stream_info,decoding P frame takes tens of micro seconds, by contrast, calling avformat_find_stream_info before decoding will reduce P frame decoding time to less than 1 ms on average. However, avformat_find_stream_info itself is time consuming on network streams. Are there anything I can do to make decoding fast without calling avformat_find_stream_info?
When avformat_find_stream_info is called, streaming URL(or local file) is scanned by this function to check valid streams in given URL.
That means, in other words, it will decode few packets from given input URL so you can decode packets fast with AVCodecContext, which is initialized in avformat_find_stream_info.
I didn't test it but It cannot be decoded without calling avformat_find_stream_info In general situation, or maybe it is initialized every time when packet is decoded.
Anyway, that's why avformat_find_stream_info consumes network traffic. because, as i said, it pulls first few packets.
If you really want to decode packets fast without calling this function, you should initialize AVCodecContext yourself, manually.

av_interleaved_write_frame return 0 but no data written

I use the ffmpeg to stream the encoded aac data , i use the
av_interleaved_write_frame()
to write frame.
The return value is 0,and it means success as the description.
Write a packet to an output media file ensuring correct interleaving.
The packet must contain one audio or video frame. If the packets are already correctly interleaved, the application should call av_write_frame() instead as it is slightly faster. It is also important to keep in mind that completely non-interleaved input will need huge amounts of memory to interleave with this, so it is preferable to interleave at the demuxer level.
Parameters
s media file handle
pkt The packet containing the data to be written. pkt->buf must be set to a valid AVBufferRef describing the packet data. Libavformat takes ownership of this reference and will unref it when it sees fit. The caller must not access the data through this reference after this function returns. This can be NULL (at any time, not just at the end), to flush the interleaving queues. Packet's stream_index field must be set to the index of the corresponding stream in s.streams. It is very strongly recommended that timing information (pts, dts duration) is set to correct values.
Returns
0 on success, a negative AVERROR on error.
However, I found no data written.
What did i miss ? How to solve it ?
av_interleaved_write_frame() must hold data in memory before it writes it out. interleaving is the process of taking multiple streams (one audio stream, one video for example) and serializing them in a monotonic order. SO, if you write an audio frame, it will keep in in memory until you write a video frame that comes 'later'. Once a later video frame is written, the audio frame can be flushed' This way streams can be processed at different speeds or in different threads, but the output is still monotonic. If you are only writing one stream (one acc stream, no video) then use av_write_frame() as suggested.

rtp decoding issue on p frames

I am streaming an rtsp stream from an IP camera. I have a parser which packages the data into frames based on the rtp payload type. The parser is able to process I frames since these contain the start of frame and end of frame packets, as well as packets in between (this is FU-A payload type).
These are combined to create a complete frame. The problem comes in when I try to construct P frames, from the wireshark dump some of these appear to be fragmented (FU-A payload type) these contain the start of frame and end of frame packets, however these do not contain packets in between. Also in some instances the camera sends strange marked packets with a payload type 1, this according to my understanding should be a complete frame.
Upon processing these two versions of P frames I then use ffmpeg to attempt to decode the frames, I receive errors messages like top block unavailable for intra mode 4x4.
At first I thought this could be due to an old ffmpeg version but I searched the web and recompiled ffmpeg with the same problem.
The I frames appear fragmented and contain lots of packets, some P frame have a start of frame (0x81) and EOF (0x41) but no packets in between and some just looked corrupt starting with 0x41 (seems like this should be the second byte) which gives payload type of 1. I am a novice when it comes to these issues but I looked at rtp documentation and I cannot find an issue with how I handle the data.
Also I stream from VLC and this seems fine but appears to halve the frame rate, I am not sure how they are able to reconstruct frames.
Please could someone help.
It is common for I-frames to be fragmented since they are usually a lot bigger than p-frames. P-frames can however also be fragmented. However there is nothing wrong with a P-frame that has been fragmented into 2 RTP packets i.e. one with the FU-header start bit set, and the following one with the end bit set. There do not need to be packets in between. For example, if the MTU is 1500, and the NAL unit is 1600 bytes large, this will be fragmented into 2 RTP packets.
As for the packets "looking corrupt" starting with 0x41 without a prior packet with a 0x81, you should examine the sequence number in the RTP header as this will tell you straight away if packets are missing. If you are seeing packet loss, the first thing to try is to increase your socket receiver buffer size.
Since VLC is able to play the stream, there is most likely an issue in the way you are reassembling the NAL units.
Also, in your question it is not always clear which byte you are referring to: I'm assuming that the 0x41 and 0x81 appear in the 2nd byte of the RTP payload, i.e. the FU header in the case where the NAL unit type of the first byte is FU-A.
Finally, note that "payload type" is the RTP payload type (RFC3550), not the NAL unit type defined in the H.264 standard.

digital audio output - what format is it in?

My MacBook has an optical digital audio output 3.5 mm plug (see here). I'm asking here on SO because I think this is a standard digital audio output plug; the description says I should use a Toslink cable with a Toslink mini-plug adapter or a fiber-optic cable.
I was wondering: What is the format of the audio data transferred over this cable? Is it a fixed format, e.g. 44.1kHz, 16bit integer, two-channel (standard PCM like from an audio CD)? Or what formats does it allow? For example, I would like to send 96kHz (or 48kHz), 32bit float (or 24bit integer), two-channel (or 6 channels) audio data over it. How is the data encoded? How does the receiver (the DA converter) know about the format? Is there some communication back from the receiver so that the receiver tells my computer what format it would prefer? Or how do I know the maximal sample rate and the maximal bit width of a sample?
How do I do that on the software side? Is it enough to tell CoreAudio to use whatever format I like and it puts that unmodified onto the cable? At least that is my goal. So basically my main questions are: What formats are supported, how do I know that my raw audio data in my application gets exactly in that format on the cable?
Digital audio interconnects like TOSLINK use the S/PDIF protocol. The channel layout and compression status is encoded in the stream, and the sample rate is implied by the speed at which the signal is sent (!). For uncompressed streams, S/PDIF transmits 24-bit (integer) PCM data. (Lower bit depths can be transmitted as well; S/PDIF just pads them out to 24 bits anyway.) Note that, due to bandwidth constraints, compression must be used if more than two channels are being transmitted.
From the software side, on OS X, most of the properties of a digital audio output are controlled by the settings of your audio output device.

Parallelize encoding of audio-only segments in ffmpeg

We are looking to decrease the execution time of segmentation/encoding from wav to aac segmented for HTTP live streaming using ffmpeg to segment and generate a m3u8 playlist by utilizing all the cores of our machine.
In one experiment, I had ffmpeg directly segment a wav file into aac with libfdk_aac, however it took quite a long time to finish.
In the second experiment, I had ffmpeg segment a wav file as is (wav) which was quite fast (< 1 second on our machines), then use GNU parallel to execute ffmpeg again to encode the wav segments to aac and manually changed the .m3u8 file without changing their durations. This was performed much faster however "silence" gaps could be heard when streaming the output audio.
I have initially tried the second scenario using mp3 and result was still quite the same. Though I've read that lame adds padding during encoding (http://scruss.com/blog/2012/02/21/generational-loss-in-mp3-re-encoding/), does this this mean that libfdk_aac also adds padding during encoding?
Maybe this one is related to this question: How can I encode and segment audio files without having gaps (or audio pops) between segments when I reconstruct it?
According to section 4 of HLS Specification, we have this:
A Transport Stream or audio elementary stream segment MUST be the
continuation of the encoded media at the end of the segment with the
previous sequence number, where values in a continuous series, such as
timestamps and Continuity Counters, continue uninterrupted
"Silence" gaps are 99,99% of times related to wrong counters/discontinuity. Because you wrote that you manually changed the .m3u8 file without changing their durations I deduce you tried to cut the audio by yourself. It can't be done.
An HLS stream can't have a parallelizable creation because of these counters. They must follow a sequence [ MPEG2-TS :-( ]. You better get a faster processor.

Resources