Forcing custom H.264 intra-frames (keyframes) at encode-time? - ffmpeg

I have a video sequence that I'd like to skip to specific frames at playback-time (my player is implemented using AVPlayer in iOS, but that's incidental). Since these frames will fall at unpredictable intervals, I can't use the standard "keyframe every N frames/seconds" functionality present in most video encoders. I do, however, know the target frames in advance.
In order to do this skipping as efficiently as possible, I need to force the target frames to be i-frames at encode time. Ideally in some kind of GUI which would let me scrub to a frame, mark it as a keyframe, and then (re)encode my video.
If such a tool isn't available, I have the feeling this could probably be done by rolling a custom encoder with libavcodec, but I'd rather use a higher-level (and preferably scriptable) tool to do the job if a GUI isn't possible. Is this the kind of task ffmpeg or mencoder can be bent to?
Does anybody have a technique for doing this? Also, it's entirely possible that this is an impossible task because of some fundamental ignorance I have of the h.264 codec. If so, please do put me right.

ffmpeg has a -force_key_frames option that accepts a series of arbitrary timestamps as well as other ways to specify the frames. From the documentation:
-force_key_frames 0:05:00,...

Answered my own question: it's possible to set custom compression keyframes in Apple Compressor.
Compression markers are also known as manual compression markers. These are markers you can add to a Final Cut Pro sequence (or in the Compressor Preview window) to indicate when Compressor should generate an MPEG I-frame during compression.
Source.

Could you not use chapter markers to jump between sections? Not an ideal solution but a lot easier to achieve.
You can use this software:
http://www.applesolutions.com/bantha/MH.html

Related

Why does it take forever just to add audio to an mp4?

I am currently using Kdenlive, but have also used ffmpeg when I have the simple task of adding audio to a video that does not yet have audio. Since it is just a matter of putting the video file together with the audio, it seems like it ought to be simple. Is there something about encoding mp4's that means it must take a lot of processing to complete?
I have good hardware (i7 6700k and gtx 1080), but kdenlive currently estimates 2.5 hours to complete adding audio to a 10 minute video.
Without more info (encoder, settings, video width x height, instructions to duplicate the behavior, etc) we can only guess. It's probably re-encoding the video instead of only muxing it. Encoding is CPU intensive and takes a long time. Although 2.5 hours for 10 minutes seems excessive, but there is not enough info in the question to say why it takes this long.
If you want to add audio with ffmpeg see How to add a new audio into a video using ffmpeg? This will allow you to mux the video (and optionally the audio) without encoding it: like a copy and paste.

FFMPEG API -- How much do stream parameters change frame-to-frame?

I'm trying to extract raw streams from devices and files using ffmpeg. I notice the crucial frame information (Video: width, height, pixel format, color space, Audio: sample format) is stored both in the AVCodecContext and in the AVFrame. This means I can access it prior to the stream playing and I can access it for every frame.
How much do I need to account for these values changing frame-to-frame? I found https://ffmpeg.org/doxygen/trunk/demuxing__decoding_8c_source.html#l00081 which indicates that at least width, height, and pixel format may change frame to frame.
Will the color space and sample format also change frame to frame?
Will these changes be temporary (a single frame) or lasting (a significant block of frames) and is there any way to predict for this stream which behavior will occur?
Is there a way to find the most descriptive attributes that this stream is possible of producing, such that I can scale all the lower-quality frames up, but not offer a result that is mindlessly higher-quality than the source, even if this is a device or a network stream where I cannot play all the frames in advance?
The fundamental question is: how do I resolve the flexibility of this API with the restriction that raw streams (my output) do not have any way of specifying a change of stream attributes mid-stream. I imagine I will need to either predict the most descriptive attributes to give the stream, or offer a new stream when the attributes change. Which choice to make depends on whether these values will change rapidly or stay relatively stable.
So, to add to what #szatmary says, the typical use case for stream parameter changes is adaptive streaming:
imagine you're watching youtube on a laptop with various methods of internet connectivity, and suddenly bandwidth decreases. Your stream will automatically switch to a lower bandwidth. FFmpeg (which is used by Chrome) needs to support this.
alternatively, imagine a similar scenario in a rtc video chat.
The reason FFmpeg does what it does is because the API is essentially trying to accommodate to the common denominator. Videos shot on a phone won't ever change resolution. Neither will most videos exported from video editing software. Even videos from youtube-dl will typically not switch resolution, this is a client-side decision, and youtube-dl simply won't do that. So what should you do? I'd just use the stream information from the first frame(s) and rescale all subsequent frames to that resolution. This will work for 99.99% for the cases. Whether you want to accommodate your service to this remaining 0.01% depends on what type of videos you think people will upload and whether resolution changes make any sense in that context.
Does colorspace change? They could (theoretically) in software that mixes screen recording with video fragments, but it's highly unlikely (in practice). Sample format changes as often as video resolution: quite often in the adaptive scenario, but whether you care depends on your service and types of videos you expect to get.
Usually not often, or ever. However, this is based on the codec and are options chosen at encode time. I pass the decoded frames through swscale just in case.

Does simple rescaling from 1080p to frame height of 720 lead to 720p?

I want to convert a 1080p to 720p and also lower resolutions eventually.
I have been using ffmpeg for all my video processing activities so far, and would simply approach this task using the following command:
ffmpeg -i tos.mov -vf scale=-1:720 tos_0x720.mov
I understand that this will rescale my video to a new frame size having 720 pixels set as a fixed height and the width dynamically calculated.
What I am not sure about are the implications regarding the quality factors of the video when using ffmpeg this way.
Is it valid to assume that running this command will output a perfect HD 720p quality video?
What would be a benefit of using dedicated video conversion software to accomplish my goal compared to running the above command?
You can choose which scaling algorithm to use by setting the flags option in the scale filter. Some algorithms work better for up-scaling (bilinear) while others are better for down-sampling (bicubic, lanczos). Some are better for sharp graphics, others for gradual changes, some are faster and some are slower.
I think the default value for flags downsampling is bicubic, while some people recommend lanczos.
To set the flag use:
-vf scale=-1:720:flags=lanczos
Commercial video conversion software use the same algorithms. For eg. Adobe Premiere used variable-radius bicubic for Maximum Render Quality. They might help you choose one ore another depending on what you're after (speed vs. quality) and they may provide tweaks to reduce artifacts resulting from scaling.
There's a lot of literature covering the different algorithms.

Detect frames that have a given image/logo with FFmpeg

I'm trying to split a video by detecting the presence of a marker (an image) in the frames. I've gone over the documentation and I see removelogo but not detectlogo.
Does anyone know how this could be achieved? I know what the logo is and the region it will be on.
I'm thinking I can extract all frames to png's and then analyse them one by one (or n by n) but it might be a lengthy process...
Any pointers?
ffmpeg doesn't have any such ability natively. The delogo filter simply works by taking a rectangular region in its parameters and interpolating that region based on its surroundings. It doesn't care what the region contained previously; it'll fill in the region regardless of what it previously contained.
If you need to detect the presence of a logo, that's a totally different task. You'll need to create it yourself; if you're serious about this, I'd recommend that you start familiarizing yourself with the ffmpeg filter API and get ready to get your hands dirty. If the logo has a distinctive color, that might be a good way to detect it.
Since what you're after is probably going to just be outputting information on which frames contain (or don't contain) the logo, one filter to look at as a model will be the blackframe filter (which searches for all-black frames).
You can write a detect-logo module, Decode the video(YUV 420P FORMAT), feed the raw frame to this module, Do a SAD(Sum of Absolute Difference) on the region where you expect a logo,if SAD is negligible its a match, record the frame number. You can split the videos at these frames.
SAD is done only on Y(luma) frames. To save processing you can scale the video to a lower resolution before decoding it.
I have successfully detect logo using a rpi and coral ai accelerator in conjunction with ffmeg to to extract the jpegs. Crop the image to just the logo then apply to your trained model. Even then you will need to sample a minute or so of video to determine the actual logos identity.

What container is easiest for combining JPEGS and MP3s as video?

So I have N (for example, 1000) JPEG frames and 10*N ( for example, 100) seconds of MP3 sound. I need some container for joining them into one video file (at 10 frames/second) (popular containers like FLV or AVI or MOV are better). So what I need is an algorithm or code example of combining my data into some popular format. The code example should be in some language like C#, Java, ActionScript or PHP. The algorithm should be theoretically implementable with ActionScript or PHP.
Can any one, please help me with that?
If you're more concerned about simplicity than anything else, Motion JPEG is probably what you want, combined with the MP3 in an AVI container.
Your best option is really, really to use an existing library to do the encoding, at least for the container, though - if you do it yourself, you're going to have to write a lot of code to handle things like interleaving video and audio, sync, etc etc.

Resources