Concatenating mka files but keeping timestamp - ffmpeg

I am trying to mix a few files with FFMPEG that are mka and are from a Twilio Video Conference recording. I am trying to get tracks for each participant but I am trying to keep the overall timestamp from the file.
Concrete example: i have these three files:
0PA1896e43f4ca0edf17d8dbfc0bab95a52.mka
1PA2a640f11bc13af2c29397800f058cb05.mka
2PA9fa5b32edc016f6f5b9669bb9b308d97.mka
These files are all tracks of a participant in the call but joined at different times(left the meeting and re-entered, results in a new file).
I want to mix those files in a single file while keeping the timestamp when it was recorded.
FFProbe shows the start of each of this files:
0PA1896e43f4ca0edf17d8dbfc0bab95a52.mka - Duration: 00:00:17.87, start: 1.360000, bitrate: 78 kb/s
1PA2a640f11bc13af2c29397800f058cb05.mka - Duration: 00:00:22.76, start: 22.521000, bitrate: 78 kb/s
2PA9fa5b32edc016f6f5b9669bb9b308d97.mka - Duration: 00:00:20.36, start: 48.944000, bitrate: 78 kb/s
So the first 00:00:17.87 should be silenced, then append the second file from 00:00:22.76 and the third from 48.944000. This would result a single file with all those 3 recording added but with silence when there is nothing, with all the recordings added. Practically, i want a delay at the start.
Imagine i'm adding a 4th recording that starts at minute 2, between recording 3 and 4 would be a gap of silence.
Or imagine a call with 3 participants but the 3rd one would enter only from minute 5. The first 5 minutes should be silenced so I can pass the trascribe api the 3rd participant and still get the correct timestamps.
The reason i want it this way is because I want to transcribe the audio to text and want the exact timestamp when the text can be heard.

You would use the aresample and amix filters for this. The aresample filter will be applied to each input in order to insert silence till the starting timestamp. These processed streams are then mixed together with the amix filter.
I'm going call the inputs 0.mka, 1.mka and 2.mka
ffmpeg -copyts -i 0.mka -i 1.mka -i 2.mka -filter_complex "[0]aresample=async=1:first_pts=0[a0];[1]aresample=async=1:first_pts=0[a1];[2]aresample=async=1:first_pts=0[a2];[a0][a1][a2]amix=inputs=3" out.mka

Related

HTML5 H264 video sometimes not displaying

Given this stream from an RTSP camera which produce H264 stream:
Input #0, rtsp, from 'rtsp://admin:admin#192.168.0.15:554':
Metadata:
title : LIVE555 Streaming Media v2017.10.28
comment : LIVE555 Streaming Media v2017.10.28
Duration: N/A, start: 0.881956, bitrate: N/A
Stream #0:0: Video: h264 (Main), yuv420p(progressive), 1600x900, 25 fps, 25 tbr, 90k tbn, 50 tbc
I want to run ffmpeg and pipe its output to a HTML5 video component with MSE.
Everything is fine and smooth as long I run this ffmpeg command (piping is removed!):
$ ffmpeg -i 'rtsp://admin:admin#192.168.0.15:554' -c:v copy -an -movflags frag_keyframe+empty_moov -f mp4
However it takes a bit time at the beginning.
I realized that the function avformat_find_stream_info makes about 15-20 seconds of delay on my system. Here is the docs.
Now I have also realized that if I add -probesize 32, avformat_find_stream_info will return almost immediately, but it cause some warnings:
$ ffmpeg -probesize 32 -i 'rtsp://admin:admin#192.168.0.15:554' -c:v copy -an -movflags frag_keyframe+empty_moov -f mp4
[rtsp # 0x1b2b300] Stream #0: not enough frames to estimate rate; consider increasing probesize
[rtsp # 0x1b2b300] decoding for stream 0 failed
Input #0, rtsp, from 'rtsp://admin:admin#192.168.0.15:554':
Metadata:
title : LIVE555 Streaming Media v2017.10.28
comment : LIVE555 Streaming Media v2017.10.28
Duration: N/A, start: 0.000000, bitrate: N/A
Stream #0:0: Video: h264 (Main), yuv420p(progressive), 1600x900, 25 tbr, 90k tbn, 50 tbc
If I dump out this stream (into a file, test.mp4), all mediaplayers can play it perfectly.
However if I pipe this output into the HTML5 video with MSE, the stream sometimes displayed correctly, sometimes it just doesn't. No warnings or error messages are printed on the console in the browser.
From the second output I can see the fps is missing. I tried to set it up manually, but was not succeed (it seemed I could not change it manually).
How can I avoid avformat_find_stream_info and have the HTML5 MSE playback if I know everything of the stream beforehand?
Update
According to #szatmary's comments and answers I have search for a h264 bitstream parser.
This is what I found. I did also save the mp4 file which is not playable by HTML5 video, but by VLC it does, and I dropped into this analyser.
Here is a screenshot of my analysis:
Some facts here:
until #66 there is no type7 (SPS) unit in the stream.
62 is the last PPS before the first SPS arrived.
there are a lot of PPS even before 62.
bitstream ends at #103.
playing in VLC the stream is 20 seconds long.
I have several things to clear:
the #62 and #66 sps/pps units (or whatever) are holding metadata only for the next coming frames, or they can even refer to previous frames?
VLC plays 20 seconds, is it possible that it scans the whole file before, then play the frames from #1 based on #62 and #66 infos? - if VLC would get the file as stream, in this case it might play only a few seconds (#66 - #103).
most important: what shall I do with the bitstream parser to make HTML5 video playing this data? Shall I drop all the units before #62? Or before #66?
Now I'm really lost in this topic. I have created a video, with FFMPEG but this time I allowed it to finish its avformat_find_stream_info function.
Saved the video with the same methods as previously. VLC now plays 18 seconds (this is okay, I have a 1000 frame limitation in ffmpeg command).
However let's see now the bitstream information:
Now PPS and SPS are 130 and 133 respectively. This resulted a stream which is 2 sec shorter than before. (I guess)
Now I have learned that in a correct parsed h264 there can still be a lot of units before the first SPS(/PPS).
SO I would finetune my question above: what shall I do with the bitstream parser to make HTML5 video playing this data?
Also the bitstream parser I have found is not good, because it uses a binary wrapper => it can not be run purely on the client side.
I'm looking at mp4box now.
How can I avoid avformat_find_stream_info and have the HTML5 MSE playback if I know everything of the stream beforehand?
You don't know everything of the stream beforehand. You don't know the resolution, or the bitrate, or the level or the profile, or the constraint flags. You don't know the scaling lists values, You don't know the VUI data, you don't know if CABAC is used.
The player needs all these things to play the video, and they are not know until the player, or ffmpeg sees the first sps/pps in the stream. By limiting the analyze duration you are telling ffmpeg to give up looking for it, so it cant be guaranteed to produce a valid stream. It may work sometimes, it may not other times, and it largely depends on what frame in the rstp stream you start on.
A possible solution would be to add more keyframes to the source video if you can, This will send the sps/pps more frequently. If you don't control the source stream, you must just wait until a sps/pps show up int the stream.

RTMP server can't stream video (only audio)

I'm implementing an RTMP server right now, and everything's been working except for video streaming. I can stream audio with no problems (using OBS to stream), and play it back via VLC. The problem is VLC plays the audio, but no video. What I'm doing right now is forwarding every audio and video message I receive from OBS, I grab the original payload (audio/video data) and put in a Type 0 Chunk, since I've seen pretty much every implementation do this. I don't know if I'm missing some sort of processing that should be done on the video data.
If I try to playback with ffmpeg (saving the RTMP stream to an flv file), then I get this output:
[NULL # 000001eb053ed440] missing picture in access unit with size 5209
[AVBSFContext # 000001eb053ecbc0] No start code is found.
rtmp://192.168.1.2/app/publish: could not find codec parameters
Input #0, flv, from 'rtmp://192.168.1.2/app/publish':
Duration: N/A, start: 0.000000, bitrate: N/A
Stream #0:0: Data: none
Stream #0:1: Video: h264, none, 1k tbn
Output #0, flv, to 'av.flv':
Output file #0 does not contain any stream
It says missing picture in access unit with size 5209, No start code is found, and could not find codec parameters. What am I missing here? I know I'm forwarding the payload exactly as I've received it in my server, I even did a hash check on the video payload I'm receiving and the one I'm sending and it's exactly the same. Any help would be greatly appreciated.
Fixed by following #szatmary's suggestion: resending the sequence headers to every playback client before sending any audio/video messages.

Scene detection and concat makes my video longer (FFMPEG)

I'm encoding videos by scenes. At this moment I got two solutions in order to do so. The first one is using a Python application which gives me a list of frames that represent scenes. Like this:
285
378
553
1145
...
The first scene begins from the frame 1 to 285, the second from 285 to 378 and so on. So, I made a bash script which encodes all this scenes. Basically what it does is to take the current and previous frames, then convert them to time and finally run the ffmpeg command:
begin=$(awk 'BEGIN{ print "'$previous'"/"'24'" }')
end=$(awk 'BEGIN{ print "'$current'"/"'24'" }')
time=$(awk 'BEGIN{ print "'$end'"-"'$begin'" }')
ffmpeg -i $video -r 24 -c:v libx265 -f mp4 -c:a aac -strict experimental -b:v 1.5M -ss $begin -t $time "output$count.mp4" -nostdin
This works perfect. The second method is using ffmpeg itself. I run this commands and gives me a list of times. Like this:
15.75
23.0417
56.0833
71.2917
...
Again I made a bash script that encodes all these times. In this case I don't have to convert to times because what I got are times:
time=$(awk 'BEGIN{ print "'$current'"-"'$previous'" }')
ffmpeg -i $video -r 24 -c:v libx265 -f mp4 -c:a aac -strict experimental -b:v 1.5M -ss $previous -t $time "output$count.mp4" -nostdin
After all this explained it comes the problem. Once all the scenes are encoded I need to concat them and for that what I do is to create a list with the video names and then run the ffmpeg command.
list.txt
file 'output1.mp4'
file 'output2.mp4'
file 'output3.mp4'
file 'output4.mp4'
command:
ffmpeg -f concat -i list.txt -c copy big_buck_bunny.mp4
The problem is that the "concated" video is longer than the original by 2.11 seconds. The original one lasts 596.45 seconds and the encoded lasts 598.56. I added up every video duration and I got 598.56. So, I think the problem is in the encoding process. Both videos have the same frames number. My goal is to get metrics about the encoding process, when I run VQMT to get the PSNR and SSIM I get weird results, I think is for this problem.
By the way, I'm using the big_buck_bunny video.
The probable difference is due to the copy codec. In the latter case, you tell ffmpeg to copy the segments, but it can't do that based on your input times.
It has to find first the previous I frames (a frame that can be decoded without any reference to any previous frame) and starts from here.
To get what you need, you need to either re-encode the video (like you did in the 2 former examples) or change the times to stop at I frames.
To assert I getting your issue correctly:
You have a source video (that's encoded at variable frame rate, close to 18fps)
You want to split the source video via ffmpeg, by forcing the frame rate to 24 fps.
Then you want to concat each segment.
I think the issue is mainly that you have some discrepancy in the timing (if I divide the frame index by the time you've given, I getting between 16fps to 18fps). When you are converting them in step 2, the output video segment time will be 24fps. ffmpeg does not resample in the time axis, so if you force a video rate, the video will accelerate or slow down.
There is also the issue of consistency for the stream:
Typically, a video stream must start with a I frame, so when splitting, FFMPEG has to locate the previous I frame (when using copy codec, and this changes the duration of the segment).
When you are concatenating, you could also have the issue of consistency (that is, if the segment you are concatenating does end with a I frame, and the next one starts with a I frame, it's possible FFMPEG drops either one, although I don't remember what is the current behavior now)
So, to solve your issue, if I were you, I would avoid step 2 (it's bad for quality anyway). That is, I would use ffmpeg to split the segments of interest based on the frame number (that's the only value that's not approximate in your scheme) in png or ppm frames (or to a pipe if you don't care about keeping them) and then concat all the frames by encoding them at the last step with the expected rate set to totalVideoTime / totalFrameCount.
You'll get a smaller and higher quality final video.
If you can't do what I said for whatever reason, at least for the concat input, you should use the ffconcat format:
ffconcat version 1.0
file segment1
duration 12.2
file segment2
duration 10.3
This will give you the expected duration by cutting each segment if it's longer
For selecting by frame number (instead of time as time is hard to get right on variable frame rate video), you should use the select filter like this:
-vf select=“between(n\,start_frame_num\,end_frame_num),setpts=STARTPTS"
I suggest checking the input and output frame rate and make sure they match. That could be a source of the discrepancy.

How to split accurately a LONG GOP video (h264/XDCAM...) with FFMPEG?

My goal is to split a XDCAM or a H264 video, frame-accurately, with ffmpeg.
I guess that the problem comes from its long GOP structure, but I'm looking for a way to split the video without re-encoding it.
I apply an offset to encode only a specific section of the video (let say from the 10th second to the end of the media)
Any ideas ?
Please refer to the ffmpeg documentation.
You will find an option -frames. That option can be use to specify for a given input stream (in the following the stream 0:0 is the 1st input file, first video stream) the number of frame to record. That option can be combined with other options to start somewhere in the input file (time offset, etc ....)
ffmpeg -i intput.ts -frames:0:0 100 -vcodec copy test.ts
that command demux and remux only the first 100 frame of the video (no re-encoding).
as said you can combine it with a jump. Using ' ‘-ss offset (input)’
' you can specify a "Frame Accurate" position ie. frame 14 after 1min10seconds = 0:1:10:14. that option should be use before the input like below.
ffmpeg -ss 00:00:10.0 -i intput.ts -frames:0:0 100 -vcodec copy test.ts
ffmpeg discard the first 10 second and bypass 100 frame to the muxer.
I`m not sure if it possible to do by 1 pass with ffmpeg, but 2-3
1st pass: you just dump raw frames to file
2nd pass: you find closed gop(mxdcam)/idr frame (h264) with index <= index of frame you want to start
in case indexes equal you can start mux. otherwise you need to decode sequense from closed gop/idr frame to next closed gop/idr frame and encode starting frame you want

Extracting metadata from incomplete video files

Can anyone tell me where metadata is stored in common video file formats? And if it would be located towards the start of the file, or scattered throughout.
I'm working with a remote object store containing a lot of video files and I want to extract metadata, in particular video duration and video dimensions from those files, without streaming the entire file contents to the local machine.
I'm hoping that this metadata will be stored in the first X bytes of files, and so I can just fetch a byte range starting at the beginning instead of the whole file, passing this partial file data to ffprobe.
For testing purposes I created a 22MB MP4 file, and used the following command to supply only the first 1MB of data to ffprobe:
head -c1024K '2013-07-04 12.20.07.mp4' | ffprobe -
It prints:
avprobe version 0.8.6-4:0.8.6-0ubuntu0.12.04.1, Copyright (c) 2007-2013 the Libav developers
built on Apr 2 2013 17:02:36 with gcc 4.6.3
[mov,mp4,m4a,3gp,3g2,mj2 # 0x1a6b7a0] stream 0, offset 0x10beab: partial file
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'pipe:':
Metadata:
major_brand : isom
minor_version : 0
compatible_brands: isom3gp4
creation_time : 1947-07-04 11:20:07
Duration: 00:00:09.84, start: 0.000000, bitrate: N/A
Stream #0.0(eng): Video: h264 (High), yuv420p, 1920x1080, 20028 kb/s, PAR 65536:65536 DAR 16:9, 29.99 fps, 30 tbr, 90k tbn, 180k tbc
Metadata:
creation_time : 1947-07-04 11:20:07
Stream #0.1(eng): Audio: aac, 48000 Hz, stereo, s16, 189 kb/s
Metadata:
creation_time : 1947-07-04 11:20:07
So I see the first 1MB was enough to extract video duration 9.84 seconds and video dimensions 1920x1080, even though ffprobe printed the warning about detecting a partial file. If I supply less than 1MB, it fails completely.
Would this approach work for other common video file formats to reliably extract metadata, or do any common formats scatter metadata throughout the file?
I'm aware of the concept of container formats and that various codecs may be used represent the audio/video data inside those containers. I'm not familiar with the details though. So I guess the question may apply to common combinations of containers + codecs? Thanks in advance.
Okay to answer my own question after a lot of digging through the specs for MP4, 3GP and AVI...
AVI
Metadata is at the start of AVI files, according to the AVI file format specification.
Video duration is not stored verbatim in AVI files, but is calculated (in microseconds) as dwMicroSecPerFrame x dwTotalFrames.
Reading between the lines of the spec, it seems that many items of metadata can be read directly from offsets within AVI files without parsing at all. But the spec does not mention these offsets explicitly so using this rule of thumb could be risky.
Offset 32: dwMicroSecPerFrame, offset 48: dwTotalFrames, offset 64: dwWidth, offset 68: dwHeight.
So for AVI, it is possible to extract this metadata with only the first X bytes of the file.
MP4, 3GP (3GPP), 3G2 (3GPP2)
All of these file formats are based on the ISO base media file format known as ISO/IEC 14496-12 (MPEG-4 Part 12).
This format allows metadata to be stored anywhere in the file, but in practice it will be either at the start or the end because the raw captured audio/video data is saved contiguously in the middle. (An exception however, would be "fragmented" MP4 files, which are rare.)
Only files with the metadata stored at the start can be played via progressive download, but it is up to the capture device or decoder to support this.
AFAICT this means that to extract metadata from these files, only the first X bytes of the file would be required, and from that information it could be determined that potentially also the last X bytes would be required. But bytes in the middle would not be required.

Resources