Extracting frames from MP4/FLV?

Extracting frames from MP4/FLV? - ffmpeg

I know it's possible with FFMPEG, but what to do if I have a partial file (like without the beginning and the end). Is is possible to extract some frames from it?

The command
ffmpeg -ss 00:00:25 -t 00:00:00.04 -i YOURMOVIE.MP4 -r 25.0 YOURIMAGE%4d.jpg
will extract frames
beginning at second 25 [-ss 00:00:25]
stopping after 0.04 second [-t 00:00:00.04]
reading from input file YOURMOVIE.MP4
using only 25.0 frames per second, i. e. one frame every 1/25 seconds [-r 25.0]
as JPEG images with the names YOURIMAGE%04d.jpg, where %4d is a 4-digit autoincrement number with leading zeros
Check you movie for the framerate before applying option [-r], same applicable for [-t], unless you want to extract the frames with the custom rate.
Never tried this with the cropped (corrupted?) input file though.
Worth to try.

This could be VERY difficult. The MP4 file format includes an 'moov' atom which has pointers to the audio and video 'samples'. If the fragment of the mp4 file you have does not have the moov atom, your job would be much more complicated. You'd have to develop logic to examine the 'mdat' atom (which contains all the audio and video samples) and use educated guesses to find the audio and video boundaries.
Even worse, without the moov atom, you won't have the SPS and PPS needed to decode the slices. You'd have to synthesize replacements; if you know the codec used to create the MP4, then you might be able to copy the SPS and PPS from a similarly encoded file; if not, it could be a painful process of trial and error, because the syntax of the slices (the H.264 encoded pictures) is dependent upon values specified in the SPS and PPS.

Related

How to remove a frame with ffmpeg without re-encoding?

I am making a datamoshing program in C++, and I need to find a way to remove one frame from a video (specifically, the p-frame right after a sequence jump) without re-encoding the video. I am currently using h.264 but would like to be able to do this with VP9 and AV1 as well.
I have one way of going about it, but it doesn't work for one frustrating reason (mentioned later). I can turn the original video into two intermediate videos - one with just the i-frame before the sequence jump, and one with the p-frame that was two frames later. I then create a concat.txt file with the following contents:
file video.mkv
file video1.mkv
And run ffmpeg -y -f concat -i concat.txt -c copy output.mp4. This produces the expected output, although is of course not as efficient as I would like since it requires creating intermediate files and reading the .txt file from disk (performance is very important in this project).
But worse yet, I couldn't generate the intermediate videos with ffmpeg, I had to use avidemux. I tried all sorts of variations on ffmpeg -y -ss 00:00:00 -i video.mp4 -t 0.04 -codec copy video.mkv, but that command seems to really bug out with videos of length 1-2 frames - while it works for longer videos no problem. My best guess is that there is some internal checker to ensure the output video is not corrupt (which, unfortunately, is exactly what I want it to be!).
Maybe there's a way to do it this way that gets around that problem, or better yet, a more elegant solution to the problem in the first place.
Thanks!

If you know the PTS or data offset or packet index of the target frame, then you can use the noise bitstream filter. This is codec-agnostic.
ffmpeg -copyts -i input -c copy -enc_time_base -1 -bsf:v:0 noise=drop=eq(pos\,11291) out
This will drop the packet from the first video stream stored at offset 11291 in the input file. See other available variables at http://www.ffmpeg.org/ffmpeg-bitstream-filters.html#noise

ffmpeg timing individual frames of an image sequence

I am having an image sequence input of webp-s concatenated (for various reasons) in a single file. I have a full control over the single file format and can potentially reformat it as a container (IVF etc.) if a proper exists.
I would like ffmpeg to consume this input and time properly each individual frame (consider first displayed for 5 seconds, next 3 seconds, 7, 12 etc.) and output a video (mp4).
My current approach is using image2pipe or webp_pipe followed by a list of loop filters, but I am curious if there are any solid alternatives potentially a simple format/container I could use in order to reduce or completely avoid ffmpeg filter instructions as there might be hundreds or more in total.
ffmpeg -filter_complex "...movie=input.webps:f=webp_pipe,loop=10:1:20,loop=10:1:10..." -y out.mp4
I am aware of concat demuxer but having a separate file for each input image is not an option in my case.
I have tried IVF format which works ok for vp8 frames, but doesnt seem to accept webp. An alternative would be welcomed, but way too many exists for me to study each single one and help would be appreciated.

When creating a Xing or Info tag in an MP3, may I use any MP3 header or does it have to match other frames?

I have a set of bare MP3 files. Bare as in I removed all tags (no ID3, no Xing, no Info) from those files.
Just before sending one of these files to the client, I want to add an Info tag. All of my files are CBR so we will use an Info tag (no Xing).
Right now I get the first 4 bytes of the existing MP3 to get the Version (MPEG-1, Layer III), Bitrate, Frequency, Stereo Mode, etc. and thus determine the size of one frame. I create the tag that way, reusing these 4 bytes for the Info tag and determining the size of the frame.
For those wondering, these 4 bytes may look like this:
FF FB 78 04
To me it felt like you are expected to use the exact same first 4 bytes in the Info tag as found in the other audio frames of the MP3, but when using ffmpeg, they stick an Info tag with a hard coded header (wrong bitrate, wrong frequency, etc.)
My question is: Is ffmpeg really doing it right? (LAME doesn't do that) Could I do the same, skipping the load of the first 4 bytes and still have the greater majority of the players out there play my files as expected?
Note: since I read these 4 bytes over the network, it would definitely save a lot of time and some bandwidth to not have to load these 4 bytes on a HEAD request. Resources I could use for the GET requests instead...

The reason for the difference is that with certain configurations, the size of a frame is less than 192 bytes. In that case, the full Info/Xing tag will not fit (and from what I can see, the four optional fields are always included, so an Info/Xing tag is always full even if not required to be).
So, for example, if you have a single channel with 44.1kHz data at 32kbps, the MP3 frame is 117 or 118 bytes. This is less than what is necessary to save the Info/Xing tag.
What LAME does in that situation is forfeit the Info/Xing tag. It's not going to be seen anywhere in the file.
On the other hand, what FFMPEG does is create a frame with a higher bitrate. So instead of 32kbps, it will try with 48kbps and then 64kbps. Once it finds a configuration which offers a frame large enough to support the Info/Xing tag, it stops. (I have not looked at the code, so how FFMPEG really finds a large enough frame, I do not know, but on my end I just incremented the bitrate index field by one until frame size >= 192 and it works).
You can replicate the feat by first creating (or converting) a WAVE file at 44.1kHz using a 32kbps bitrate then try to convert it to MP3 using ffmpeg and see that the Info/Xing tag has a different bitrate.

Scene detection and concat makes my video longer (FFMPEG)

I'm encoding videos by scenes. At this moment I got two solutions in order to do so. The first one is using a Python application which gives me a list of frames that represent scenes. Like this:
285
378
553
1145
...
The first scene begins from the frame 1 to 285, the second from 285 to 378 and so on. So, I made a bash script which encodes all this scenes. Basically what it does is to take the current and previous frames, then convert them to time and finally run the ffmpeg command:
begin=$(awk 'BEGIN{ print "'$previous'"/"'24'" }')
end=$(awk 'BEGIN{ print "'$current'"/"'24'" }')
time=$(awk 'BEGIN{ print "'$end'"-"'$begin'" }')
ffmpeg -i $video -r 24 -c:v libx265 -f mp4 -c:a aac -strict experimental -b:v 1.5M -ss $begin -t $time "output$count.mp4" -nostdin
This works perfect. The second method is using ffmpeg itself. I run this commands and gives me a list of times. Like this:
15.75
23.0417
56.0833
71.2917
...
Again I made a bash script that encodes all these times. In this case I don't have to convert to times because what I got are times:
time=$(awk 'BEGIN{ print "'$current'"-"'$previous'" }')
ffmpeg -i $video -r 24 -c:v libx265 -f mp4 -c:a aac -strict experimental -b:v 1.5M -ss $previous -t $time "output$count.mp4" -nostdin
After all this explained it comes the problem. Once all the scenes are encoded I need to concat them and for that what I do is to create a list with the video names and then run the ffmpeg command.
list.txt
file 'output1.mp4'
file 'output2.mp4'
file 'output3.mp4'
file 'output4.mp4'
command:
ffmpeg -f concat -i list.txt -c copy big_buck_bunny.mp4
The problem is that the "concated" video is longer than the original by 2.11 seconds. The original one lasts 596.45 seconds and the encoded lasts 598.56. I added up every video duration and I got 598.56. So, I think the problem is in the encoding process. Both videos have the same frames number. My goal is to get metrics about the encoding process, when I run VQMT to get the PSNR and SSIM I get weird results, I think is for this problem.
By the way, I'm using the big_buck_bunny video.

The probable difference is due to the copy codec. In the latter case, you tell ffmpeg to copy the segments, but it can't do that based on your input times.
It has to find first the previous I frames (a frame that can be decoded without any reference to any previous frame) and starts from here.
To get what you need, you need to either re-encode the video (like you did in the 2 former examples) or change the times to stop at I frames.
To assert I getting your issue correctly:
You have a source video (that's encoded at variable frame rate, close to 18fps)
You want to split the source video via ffmpeg, by forcing the frame rate to 24 fps.
Then you want to concat each segment.
I think the issue is mainly that you have some discrepancy in the timing (if I divide the frame index by the time you've given, I getting between 16fps to 18fps). When you are converting them in step 2, the output video segment time will be 24fps. ffmpeg does not resample in the time axis, so if you force a video rate, the video will accelerate or slow down.
There is also the issue of consistency for the stream:
Typically, a video stream must start with a I frame, so when splitting, FFMPEG has to locate the previous I frame (when using copy codec, and this changes the duration of the segment).
When you are concatenating, you could also have the issue of consistency (that is, if the segment you are concatenating does end with a I frame, and the next one starts with a I frame, it's possible FFMPEG drops either one, although I don't remember what is the current behavior now)
So, to solve your issue, if I were you, I would avoid step 2 (it's bad for quality anyway). That is, I would use ffmpeg to split the segments of interest based on the frame number (that's the only value that's not approximate in your scheme) in png or ppm frames (or to a pipe if you don't care about keeping them) and then concat all the frames by encoding them at the last step with the expected rate set to totalVideoTime / totalFrameCount.
You'll get a smaller and higher quality final video.
If you can't do what I said for whatever reason, at least for the concat input, you should use the ffconcat format:
ffconcat version 1.0
file segment1
duration 12.2
file segment2
duration 10.3
This will give you the expected duration by cutting each segment if it's longer
For selecting by frame number (instead of time as time is hard to get right on variable frame rate video), you should use the select filter like this:
-vf select=“between(n\,start_frame_num\,end_frame_num),setpts=STARTPTS"

I suggest checking the input and output frame rate and make sure they match. That could be a source of the discrepancy.

Does a track run in a fragmented MP4 have to start with a key frame?

I'm ingesting an RTMP stream and converting it to a fragmented MP4 file in JavaScript. It took a week of work but I'm almost finished with this task. I'm generating a valid ftyp atom, moov atom, and moof atom and the first frame of the video actually plays (with audio) before it goes into an infinite buffering with no errors listed in chrome://media-internals
Plugging the video into ffprobe, I get an error similar to:
[mov,mp4,m4a,3gp,3g2,mj2 # 0x558559198080] Failed to add index entry
Last message repeated 368 times
[h264 # 0x55855919b300] Invalid NAL unit size (-619501801 > 966).
[h264 # 0x55855919b300] Error splitting the input into NAL units.
This led me on a massive hunt for data alignment issues or invalid byte offsets in my tfhd and trun atoms, however no matter where I looked or how I sliced the data, I couldn't find any problems in the moof atom.
I then took the original FLV file and converted it to an MP4 in ffmpeg with the following command:
ffmpeg -i ~/Videos/rtmp/big_buck_bunny.flv -c copy -ss 5 -t 10 -movflags frag_keyframe+empty_moov+faststart test.mp4
I opened both the MP4 I was creating and the MP4 output by ffmpeg in an atom parsing file and compared the two:
The first thing that jumped out at me was the ffmpeg-generated file has multiple video samples per moof. Specifically, every moof started with 1 key frame, then contained all difference frames until the next key frame (which was used as the start of the following moof atom)
Contrast this with how I'm generating my MP4. I create a moof atom every time an FLV VIDEODATA packet arrives. This means my moof may not contain a key frame (and usually doesn't)
Could this be why I'm having trouble? Or is there something else I'm missing?
The video files in question can be downloaded here:
ffmpeg-generated file: test.mp4
manually-generated file: invalid-nal-size.mp4
Another issue I noticed was ffmpeg's prolific use of base_data_offset in the tfhd atom. However when I tried tracking the total number of bytes appended and setting the base_data_offset myself, I got an error in Chrome along the lines of: "MSE doesn't support base_data_offset". Per the ISO/IEC 14996-10 spec:
If not provided, the base-data-offset for the first track in the movie fragment is the position of the first byte of the enclosing Movie Fragment Box, and for second and subsequent track fragments, the default is the end of the data defined by the preceding fragment.
This wording leads me to believe that the data_offset in the first trun atom should be equal to the size of the moof atom and the data_offset in the second trun atom should be 0 (0 bytes from the end of the data defined by the preceding fragment). However when I tried this I got an error that the video data couldn't be parsed. What did lead to data that could be parsed was the length of the moof atom plus the total length of the first track (as if the base offset were the first byte of the enclosing moof box, same as the first track)

No, the moof does not need to start with a key frame. The file you are generating produces invalid NALUs size errors, Because it has invalid nal sizes. Every nal (in the mdat) must have the size prepended to it. Looking at your file, the first 4 bytes after the mdat is 0x21180C68 which is WAY too large to be a valid size.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio