DirectShow push sources, syncing and timestamping - filter

I have a filter graph that takes raw audio and video input and then uses the ASF Writer to encode them to a WMV file.
I've written two custom push source filters to provide the input to the graph. The audio filter just uses WASAPI in loopback mode to capture the audio and send the data downstream. The video filter takes raw RGB frames and sends them downstream.
For both the audio and video frames, I have the performance counter value for the time the frames were captured.
Question 1: If I want to properly timestamp the video and audio, do I need to create a custom reference clock that uses the performance counter or is there a better way for me to sync the two inputs, i.e. calculate the stream time?
The video input is captured from a Direct3D buffer somewhere else and I cannot guarantee the framerate, so it behaves like a live source. I always know the start time of a frame, of course, but how do I know the end time?
For instance, let's say the video filter ideally wants to run at 25 FPS, but due to latency and so on, frame 1 starts perfectly at the 1/25th mark but frame 2 starts later than the expected 2/25th mark. That means there's now a gap in the graph since the end time of frame 1 doesn't match the start time of frame 2.
Question 2: Will the downstream filters know what to do with the delay between frame 1 and 2, or do I manually have to decrease the length of frame 2?

One option is to omit time stamps, but this might end up in filters fail to process this data. Another option is to use System Reference Clock to generate time stamps - in any way this is preferable to directly using performance counter as a time stamp source.
Yes you need to time stamp video and audio in order to keep them in sync, this is the only way to tell that data is actually attributed to the same time
Video samples don't have time, you can omit stop time or set it equal to start time, a gap between video frame stop time and next frame start time has no consequences
Renderers are free to choose whether they need to respect time stamps or not, with audio you of course will want smooth stream without gaps in time stamps

Related

GOP size does not correlate with actual latency

As far as I know, GOP size should correlate with observable video delay (latency). For example, if GOP size is 2, then delay in video should be near two seconds and so on, at least with CBR. But, when I set GOP size to 2, publish stream to ingest server, consume this stream and measure latency, it is between 0.8-1.2 seconds, not 2+ seconds as excepted. Increasing GOP size leads to same results: with GOP 4 latency is near 2.5 seconds, not 4 seconds.
How I measure this latency: stream working stopwatch from web-camera using OBS to ingest server and calculate difference between stopwatch value and value displayed in stream consumed from ingest. For greater measurement accuracy, I make a photo with stopwatch and actual image from ingest in one field of view.
My OBS settings is here:
Can you suggest, why do I get such results and how relevant is my statement about correlation between GOP size and video latency? Maybe, H264 settings like "zerolatency" makes some magic?
Thanks.
For streaming, each group of pictures is made of IPPPPPP --a key frame followed by some number of seconds' worth of P frames. In principle, an encoder need not incur a delay of any given length. When you send constant bit rate streams, the delay happens because the encoder must sometimes recode some frames at a lower or higher bit rate.

FFmpeg dash manifest '-window_size'

In the FFmpeg DASH documentation I don't understand the purpose of -window_size which is explained as:
Set the maximum number of segments kept in the manifest.
If my video is 30 seconds long, the GOP size is 4 seconds and the segment length is 4 seconds, what is the meaning and purpose of a parameter to control the maximum number of segments kept in the manifest, when does this parameter need to be used and how do you determine valid values?
I'm guessing that the stream is being loaded into memory and the number of segments in the manifest controls how much is kept in memory at one time but it's just a wild guess and I can't find any further explanation.
I am not live streaming in case it's relevant.
The window size is relevant if you stream live. In a live scenario a player could rewind and the window size determines how far a player could go back. Since you are not live streaming - it is not relevant for you.

av_interleaved_write_frame return 0 but no data written

I use the ffmpeg to stream the encoded aac data , i use the
av_interleaved_write_frame()
to write frame.
The return value is 0,and it means success as the description.
Write a packet to an output media file ensuring correct interleaving.
The packet must contain one audio or video frame. If the packets are already correctly interleaved, the application should call av_write_frame() instead as it is slightly faster. It is also important to keep in mind that completely non-interleaved input will need huge amounts of memory to interleave with this, so it is preferable to interleave at the demuxer level.
Parameters
s media file handle
pkt The packet containing the data to be written. pkt->buf must be set to a valid AVBufferRef describing the packet data. Libavformat takes ownership of this reference and will unref it when it sees fit. The caller must not access the data through this reference after this function returns. This can be NULL (at any time, not just at the end), to flush the interleaving queues. Packet's stream_index field must be set to the index of the corresponding stream in s.streams. It is very strongly recommended that timing information (pts, dts duration) is set to correct values.
Returns
0 on success, a negative AVERROR on error.
However, I found no data written.
What did i miss ? How to solve it ?
av_interleaved_write_frame() must hold data in memory before it writes it out. interleaving is the process of taking multiple streams (one audio stream, one video for example) and serializing them in a monotonic order. SO, if you write an audio frame, it will keep in in memory until you write a video frame that comes 'later'. Once a later video frame is written, the audio frame can be flushed' This way streams can be processed at different speeds or in different threads, but the output is still monotonic. If you are only writing one stream (one acc stream, no video) then use av_write_frame() as suggested.

MME Audio Output Buffer Size

I am currently playing around with outputting FP32 samples via the old MME API (waveOutXxx functions). The problem I've bumped into is that if I provide a buffer length that does not evenly divide the sample rate, certain audible clicks appear in the audio stream; when recorded, it looks like some of the samples are lost (I'm generating a sine wave for the test). Currently I am using the "magic" value of 2205 samples per buffer for 44100 sample rate.
The question is, does anybody know the reason for these dropouts and if there is some magic formula that provides a way to compute the "proper" buffer size?
Safe alignment of data buffers is the value of nBlockAlign of WAVEFORMATEX structure.
Software must process a multiple of nBlockAlign bytes of data at a
time. Data written to and read from a device must always start at the
beginning of a block. For example, it is illegal to start playback of
PCM data in the middle of a sample (that is, on a non-block-aligned
boundary).
For PCM formats this is the amount of bytes for single sample across all channels. Non-PCM formats have their own alignments, often equal to length of format-specific block, e.g. 20 ms.
Back in time when waveOutXxx was the primary API for audio, carrying over unaligned bytes was an unreasonable burden for the API and unneeded performance overhead. Right now this API is a compatibility layer on top of other audio APIs, and I suppose that unaligned bytes are just stripped to still play the rest of the content, which would otherwise be rejected in full due to this small glitch, which might be just a smaller and non-fatal caller's inaccuracy.
if you fill the audio buffer with sine sample and play it looped , very easily it will click , unless the buffer length is not a multiple of the frequence, as you said ... the audible click in fact is a discontinuity in the wave ...an advanced techinques is to fill the buffer dinamically , that is, you should set a callback notification while the buffer pointer advance and fill the buffer with appropriate data at appropriate offset. i would use a more large buffer as 2205 is too short to get an async notification , calculate data , and write the buffer ,all that while playing , but it would depend of cpu power

MPEG2 TS end-to-end delay

I need to calculate the end to end delay (between the encoder and the decoder) in an MPEG 2 TS
based on time stamps information (PTS PCR DTS). Are those time stamps enough to calculate the delay?
These time-stamps are inserted into the transport stream by the encoder, and are used by the decoder - such as syncing between audio and video frames, and in general locking with the original clock to display the video correctly.
The delay between an encoder and a decoder, on the other hand, is like asking what is the delay between transmitting the data from the source and receiving it in the destination. This is not determined by the data (i.e. the transport stream and the data within such as time stamps) but by the network conditions.

Resources