Libavformat/FFMPEG: Muxing into mp4 with AVFormatContext drops the final frame, depending on the number of frames - ffmpeg

I am trying to use libavformat to create a .mp4 video
with a single h.264 video stream, but the final frame in the resulting file
often has a duration of zero and is effectively dropped from the video.
Strangely enough, whether the final frame is dropped or not depends on how many
frames I try to add to the file. Some simple testing that I outline below makes
me think that I am somehow misconfiguring either the AVFormatContext or the
h.264 encoder, resulting in two edit lists that sometimes chop off the final
frame. I will also post a simplified version of the code I am using, in case I'm
making some obvious mistake. Any help would be greatly appreciated: I've been
struggling with this issue for the past few days and have made little progress.
I can recover the dropped frame by creating a new mp4 container using ffmpeg
binary with the copy codec if I use the -ignore_editlist option. Inspecting
the file with a missing frame using ffprobe, mp4trackdump, or mp4file --dump, shows that the final frame is dropped if its sample time is exactly the
same the end of the edit list. When I make a file that has no dropped frames, it
still has two edit lists: the only difference is that the end time of the edit
list is beyond all samples in files that do not have dropped frames. Though this
is hardly a fair comparison, if I make a .png for each frame and then generate
a .mp4 with ffmpeg using the image2 codec and similar h.264 settings, I
produce a movie with all frames present, only one edit list, and similar PTS
times as my mangled movies with two edit lists. In this case, the edit list
always ends after the last frame/sample time.
I am using this command to determine the number of frames in the resulting stream,
though I also get the same number with other utilities:
ffprobe -v error -count_frames -select_streams v:0 -show_entries stream=nb_read_frames -of default=nokey=1:noprint_wrappers=1 video_file_name.mp4
Simple inspection of the file with ffprobe shows no obviously alarming signs to
me, besides the framerate being affected by the missing frame (the target was
24):
$ ffprobe -hide_banner testing.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'testing.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.45.100
Duration: 00:00:04.13, start: 0.041016, bitrate: 724 kb/s
Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 100x100, 722 kb/s, 24.24 fps, 24 tbr, 12288 tbn, 48 tbc (default)
Metadata:
handler_name : VideoHandler
The files that I generate programatically always have two edit lists, one of
which is very short. In files both with and without a missing frame, the
duration one of the frames is 0, while all the others have the same duration
(512). You can see this in the ffmpeg output for this file that I tried to put
100 frames into, though only 99 are visible despite the file containing all 100
samples.
$ ffmpeg -hide_banner -y -v 9 -loglevel 99 -i testing.mp4
...
<edited to remove the class printing>
type:'edts' parent:'trak' sz: 48 100 948
type:'elst' parent:'edts' sz: 40 8 40
track[0].edit_count = 2
duration=41 time=-1 rate=1.000000
duration=4125 time=0 rate=1.000000
type:'mdia' parent:'trak' sz: 808 148 948
type:'mdhd' parent:'mdia' sz: 32 8 800
type:'hdlr' parent:'mdia' sz: 45 40 800
ctype=[0][0][0][0]
stype=vide
type:'minf' parent:'mdia' sz: 723 85 800
type:'vmhd' parent:'minf' sz: 20 8 715
type:'dinf' parent:'minf' sz: 36 28 715
type:'dref' parent:'dinf' sz: 28 8 28
Unknown dref type 0x206c7275 size 12
type:'stbl' parent:'minf' sz: 659 64 715
type:'stsd' parent:'stbl' sz: 151 8 651
size=135 4CC=avc1 codec_type=0
type:'avcC' parent:'stsd' sz: 49 8 49
type:'stts' parent:'stbl' sz: 32 159 651
track[0].stts.entries = 2
sample_count=99, sample_duration=512
sample_count=1, sample_duration=0
...
AVIndex stream 0, sample 99, offset 5a0ed, dts 50688, size 3707, distance 0, keyframe 1
Processing st: 0, edit list 0 - media time: -1, duration: 504
Processing st: 0, edit list 1 - media time: 0, duration: 50688
type:'udta' parent:'moov' sz: 98 1072 1162
...
The last frame has zero duration:
$ mp4trackdump -v testing.mp4
...
mp4file testing.mp4, track 1, samples 100, timescale 12288
sampleId 1, size 6943 duration 512 time 0 00:00:00.000 S
sampleId 2, size 3671 duration 512 time 512 00:00:00.041 S
...
sampleId 99, size 3687 duration 512 time 50176 00:00:04.083 S
sampleId 100, size 3707 duration 0 time 50688 00:00:04.125 S
Non-mangled videos that I generate have similar structure, as you can see in
this video that had 99 input frames, all of which are visible in the output.
Even though the sample_duration is set to zero for one of the samples in the
stss box, it is not dropped from the frame count or when reading the frames back
in with ffmpeg.
$ ffmpeg -hide_banner -y -v 9 -loglevel 99 -i testing_99.mp4
...
type:'elst' parent:'edts' sz: 40 8 40
track[0].edit_count = 2
duration=41 time=-1 rate=1.000000
duration=4084 time=0 rate=1.000000
...
track[0].stts.entries = 2
sample_count=98, sample_duration=512
sample_count=1, sample_duration=0
...
AVIndex stream 0, sample 98, offset 5d599, dts 50176, size 3833, distance 0, keyframe 1
Processing st: 0, edit list 0 - media time: -1, duration: 504
Processing st: 0, edit list 1 - media time: 0, duration: 50184
...
$ mp4trackdump -v testing_99.mp4
...
sampleId 98, size 3814 duration 512 time 49664 00:00:04.041 S
sampleId 99, size 3833 duration 0 time 50176 00:00:04.083 S
One difference that jumps out to me is that the mangled file's second edit list
ends at time 50688, which coincides with the last sample, while the non-mangled
file's edit list ends at 50184, which is after the time of the last sample
at 50176. As I mentioned before, whether the last frame is clipped depends on
the number of frames I encode and mux into the container: 100 input frames
results in 1 dropped frame, 99 results in 0, 98 in 0, 97 in 1, etc...
Here is the code that I used to generate these files, which is a MWE script
version of library functions that I am modifying. It is written in Julia,
which I do not think is important here, and calls the FFMPEG library version
4.3.1. It's more or less a direct translation from of the FFMPEG muxing
demo, although the codec
context here is created before the format context. I am presenting the code that
interacts with ffmpeg first, although it relies on some helper code that I will
put below.
The helper code just makes it easier to work with nested C structs in Julia, and
allows . syntax in Julia to be used in place of C's arrow (->) operator for
field access of struct pointers. Libav structs such as AVFrame appear as a
thin wrapper type AVFramePtr, and similarly AVStream appears as
AVStreamPtr etc... These act like single or double pointers for the purposes
of function calls, depending on the function's type signature. Hopefully it will
be clear enough to understand if you are familiar with working with libav in C,
and I don't think looking at the helper code should be necessary if you don't
want to run the code.
# Function to transfer array to AVPicture/AVFrame
function transfer_img_buf_to_frame!(frame, img)
img_pointer = pointer(img)
data_pointer = frame.data[1] # Base-1 indexing, get pointer to first data buffer in frame
for h = 1:frame.height
data_line_pointer = data_pointer + (h-1) * frame.linesize[1] # base-1 indexing
img_line_pointer = img_pointer + (h-1) * frame.width
unsafe_copyto!(data_line_pointer, img_line_pointer, frame.width) # base-1 indexing
end
end
# Function to transfer AVFrame to AVCodecContext, and AVPacket to AVFormatContext
function encode_mux!(packet, format_context, frame, codec_context; flush = false)
if flush
fret = avcodec_send_frame(codec_context, C_NULL)
else
fret = avcodec_send_frame(codec_context, frame)
end
if fret < 0 && !in(fret, [-Libc.EAGAIN, VIO_AVERROR_EOF])
error("Error $fret sending a frame for encoding")
end
pret = Cint(0)
while pret >= 0
pret = avcodec_receive_packet(codec_context, packet)
if pret == -Libc.EAGAIN || pret == VIO_AVERROR_EOF
break
elseif pret < 0
error("Error $pret during encoding")
end
stream = format_context.streams[1] # Base-1 indexing
av_packet_rescale_ts(packet, codec_context.time_base, stream.time_base)
packet.stream_index = 0
ret = av_interleaved_write_frame(format_context, packet)
ret < 0 && error("Error muxing packet: $ret")
end
if !flush && fret == -Libc.EAGAIN && pret != VIO_AVERROR_EOF
fret = avcodec_send_frame(codec_context, frame)
if fret < 0 && fret != VIO_AVERROR_EOF
error("Error $fret sending a frame for encoding")
end
end
return pret
end
# Set parameters of test movie
nframe = 100
width, height = 100, 100
framerate = 24
gop = 0
codec_name = "libx264"
filename = "testing.mp4"
((width % 2 !=0) || (height % 2 !=0)) && error("Encoding error: Image dims must be a multiple of two")
# Make test images
imgstack = map(x->rand(UInt8,width,height),1:nframe);
pix_fmt = AV_PIX_FMT_GRAY8
framerate_rat = Rational(framerate)
codec = avcodec_find_encoder_by_name(codec_name)
codec == C_NULL && error("Codec '$codec_name' not found")
# Allocate AVCodecContext
codec_context_p = avcodec_alloc_context3(codec) # raw pointer
codec_context_p == C_NULL && error("Could not allocate AVCodecContext")
# Easier to work with pointer that acts like a c struct pointer, type defined below
codec_context = AVCodecContextPtr(codec_context_p)
codec_context.width = width
codec_context.height = height
codec_context.time_base = AVRational(1/framerate_rat)
codec_context.framerate = AVRational(framerate_rat)
codec_context.pix_fmt = pix_fmt
codec_context.gop_size = gop
ret = avcodec_open2(codec_context, codec, C_NULL)
ret < 0 && error("Could not open codec: Return code $(ret)")
# Allocate AVFrame and wrap it in a Julia convenience type
frame_p = av_frame_alloc()
frame_p == C_NULL && error("Could not allocate AVFrame")
frame = AVFramePtr(frame_p)
frame.format = pix_fmt
frame.width = width
frame.height = height
# Allocate picture buffers for frame
ret = av_frame_get_buffer(frame, 0)
ret < 0 && error("Could not allocate the video frame data")
# Allocate AVPacket and wrap it in a Julia convenience type
packet_p = av_packet_alloc()
packet_p == C_NULL && error("Could not allocate AVPacket")
packet = AVPacketPtr(packet_p)
# Allocate AVFormatContext and wrap it in a Julia convenience type
format_context_dp = Ref(Ptr{AVFormatContext}()) # double pointer
ret = avformat_alloc_output_context2(format_context_dp, C_NULL, C_NULL, filename)
if ret != 0 || format_context_dp[] == C_NULL
error("Could not allocate AVFormatContext")
end
format_context = AVFormatContextPtr(format_context_dp)
# Add video stream to AVFormatContext and configure it to use the encoder made above
stream_p = avformat_new_stream(format_context, C_NULL)
stream_p == C_NULL && error("Could not allocate output stream")
stream = AVStreamPtr(stream_p) # Wrap this pointer in a convenience type
stream.time_base = codec_context.time_base
stream.avg_frame_rate = 1 / convert(Rational, stream.time_base)
ret = avcodec_parameters_from_context(stream.codecpar, codec_context)
ret < 0 && error("Could not set parameters of stream")
# Open the AVIOContext
pb_ptr = field_ptr(format_context, :pb)
# This following is just a call to avio_open, with a bit of extra protection
# so the Julia garbage collector does not destroy format_context during the call
ret = GC.#preserve format_context avio_open(pb_ptr, filename, AVIO_FLAG_WRITE)
ret < 0 && error("Could not open file $filename for writing")
# Write the header
ret = avformat_write_header(format_context, C_NULL)
ret < 0 && error("Could not write header")
# Encode and mux each frame
for i in 1:nframe # iterate from 1 to nframe
img = imgstack[i] # base-1 indexing
ret = av_frame_make_writable(frame)
ret < 0 && error("Could not make frame writable")
transfer_img_buf_to_frame!(frame, img)
frame.pts = i
encode_mux!(packet, format_context, frame, codec_context)
end
# Flush the encoder
encode_mux!(packet, format_context, frame, codec_context; flush = true)
# Write the trailer
av_write_trailer(format_context)
# Close the AVIOContext
pb_ptr = field_ptr(format_context, :pb) # get pointer to format_context.pb
ret = GC.#preserve format_context avio_closep(pb_ptr) # simply a call to avio_closep
ret < 0 && error("Could not free AVIOContext")
# Deallocation
avcodec_free_context(codec_context)
av_frame_free(frame)
av_packet_free(packet)
avformat_free_context(format_context)
Below is the helper code that makes accessing pointers to nested c structs not a
total pain in Julia. If you try to run the code yourself, please enter this in
before the logic of the code shown above. It requires
VideoIO.jl, a Julia wrapper to libav.
# Convenience type and methods to make the above code look more like C
using Base: RefValue, fieldindex
import Base: unsafe_convert, getproperty, setproperty!, getindex, setindex!,
unsafe_wrap, propertynames
# VideoIO is a Julia wrapper to libav
#
# Bring bindings to libav library functions into namespace
using VideoIO: AVCodecContext, AVFrame, AVPacket, AVFormatContext, AVRational,
AVStream, AV_PIX_FMT_GRAY8, AVIO_FLAG_WRITE, AVFMT_NOFILE,
avformat_alloc_output_context2, avformat_free_context, avformat_new_stream,
av_dump_format, avio_open, avformat_write_header,
avcodec_parameters_from_context, av_frame_make_writable, avcodec_send_frame,
avcodec_receive_packet, av_packet_rescale_ts, av_interleaved_write_frame,
avformat_query_codec, avcodec_find_encoder_by_name, avcodec_alloc_context3,
avcodec_open2, av_frame_alloc, av_frame_get_buffer, av_packet_alloc,
avio_closep, av_write_trailer, avcodec_free_context, av_frame_free,
av_packet_free
# Submodule of VideoIO
using VideoIO: AVCodecs
# Need to import this function from Julia's Base to add more methods
import Base: convert
const VIO_AVERROR_EOF = -541478725 # AVERROR_EOF
# Methods to convert between AVRational and Julia's Rational type, because it's
# hard to access the AV rational macros with Julia's C interface
convert(::Type{Rational{T}}, r::AVRational) where T = Rational{T}(r.num, r.den)
convert(::Type{Rational}, r::AVRational) = Rational(r.num, r.den)
convert(::Type{AVRational}, r::Rational) = AVRational(numerator(r), denominator(r))
"""
mutable struct NestedCStruct{T}
Wraps a pointer to a C struct, and acts like a double pointer to that memory.
The methods below will automatically convert it to a single pointer if needed
for a function call, and make interacting with it in Julia look (more) similar
to interacting with it in C, except '->' in C is replaced by '.' in Julia.
"""
mutable struct NestedCStruct{T}
data::RefValue{Ptr{T}}
end
NestedCStruct{T}(a::Ptr) where T = NestedCStruct{T}(Ref(a))
NestedCStruct(a::Ptr{T}) where T = NestedCStruct{T}(a)
const AVCodecContextPtr = NestedCStruct{AVCodecContext}
const AVFramePtr = NestedCStruct{AVFrame}
const AVPacketPtr = NestedCStruct{AVPacket}
const AVFormatContextPtr = NestedCStruct{AVFormatContext}
const AVStreamPtr = NestedCStruct{AVStream}
function field_ptr(::Type{S}, struct_pointer::Ptr{T}, field::Symbol,
index::Integer = 1) where {S,T}
fieldpos = fieldindex(T, field)
field_pointer = convert(Ptr{S}, struct_pointer) +
fieldoffset(T, fieldpos) + (index - 1) * sizeof(S)
return field_pointer
end
field_ptr(a::Ptr{T}, field::Symbol, args...) where T =
field_ptr(fieldtype(T, field), a, field, args...)
function check_ptr_valid(p::Ptr, err::Bool = true)
valid = p != C_NULL
err && !valid && error("Invalid pointer")
valid
end
unsafe_convert(::Type{Ptr{T}}, ap::NestedCStruct{T}) where T =
getfield(ap, :data)[]
unsafe_convert(::Type{Ptr{Ptr{T}}}, ap::NestedCStruct{T}) where T =
unsafe_convert(Ptr{Ptr{T}}, getfield(ap, :data))
function check_ptr_valid(a::NestedCStruct{T}, args...) where T
p = unsafe_convert(Ptr{T}, a)
GC.#preserve a check_ptr_valid(p, args...)
end
nested_wrap(x::Ptr{T}) where T = NestedCStruct(x)
nested_wrap(x) = x
function getproperty(ap::NestedCStruct{T}, s::Symbol) where T
check_ptr_valid(ap)
p = unsafe_convert(Ptr{T}, ap)
res = GC.#preserve ap unsafe_load(field_ptr(p, s))
nested_wrap(res)
end
function setproperty!(ap::NestedCStruct{T}, s::Symbol, x) where T
check_ptr_valid(ap)
p = unsafe_convert(Ptr{T}, ap)
fp = field_ptr(p, s)
GC.#preserve ap unsafe_store!(fp, x)
end
function getindex(ap::NestedCStruct{T}, i::Integer) where T
check_ptr_valid(ap)
p = unsafe_convert(Ptr{T}, ap)
res = GC.#preserve ap unsafe_load(p, i)
nested_wrap(res)
end
function setindex!(ap::NestedCStruct{T}, i::Integer, x) where T
check_ptr_valid(ap)
p = unsafe_convert(Ptr{T}, ap)
GC.#preserve ap unsafe_store!(p, x, i)
end
function unsafe_wrap(::Type{T}, ap::NestedCStruct{S}, i) where {S, T}
check_ptr_valid(ap)
p = unsafe_convert(Ptr{S}, ap)
GC.#preserve ap unsafe_wrap(T, p, i)
end
function field_ptr(::Type{S}, a::NestedCStruct{T}, field::Symbol,
args...) where {S, T}
check_ptr_valid(a)
p = unsafe_convert(Ptr{T}, a)
GC.#preserve a field_ptr(S, p, field, args...)
end
field_ptr(a::NestedCStruct{T}, field::Symbol, args...) where T =
field_ptr(fieldtype(T, field), a, field, args...)
propertynames(ap::T) where {S, T<:NestedCStruct{S}} = (fieldnames(S)...,
fieldnames(T)...)
Edit: Some things that I have already tried
Explicitly setting the stream duration to be the same number as the number of frames that I add, or a few more beyond that
Explicitly setting the stream start time to zero, while the first frame has a PTS of 1
Playing around with encoder parameters, as well as gop_size, using B frames, etc.
Setting the private data for the mov/mp4 muxer to set the movflag negative_cts_offsets
Changing the framerate
Tried different pixel formats, such as AV_PIX_FMT_YUV420P
Also to be clear while I can just transfer the file into another while ignoring the edit lists to work around this problem, I am hoping to not make damaged mp4 files in the first place.

I had a similar issue, where the final frame was missing and this caused the resulting calculated FPS to be different from what I expected.
It doesn't seem like you are setting AVPacket's duration field. I found out that relying on automatic duration (leaving the field to 0) showed that issue you describe.
If you have constant framerate you can calculate how much the duration should be, E.G. set it to 512 for a 12800 time base (= 1/25 of a second) for 25 FPS. Hopefully that helps.

Related

Why is ffmpeg faster than this minimal example?

I'm wanting to read the audio out of a video file as fast as possible, using the libav libraries. It's all working fine, but it seems like it could be faster.
To get a performance baseline, I ran this ffmpeg command and timed it:
time ffmpeg -threads 1 -i file -map 0:a:0 -f null -
On a test file (a 2.5gb 2hr .MOV with pcm_s16be audio) this comes out to about 1.35 seconds on my M1 Macbook Pro.
On the other hand, this minimal C code (based on FFmpeg's "Demuxing and decoding" example) is consistently around 0.3 seconds slower.
#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>
static int decode_packet(AVCodecContext *dec, const AVPacket *pkt, AVFrame *frame)
{
int ret = 0;
// submit the packet to the decoder
ret = avcodec_send_packet(dec, pkt);
// get all the available frames from the decoder
while (ret >= 0) {
ret = avcodec_receive_frame(dec, frame);
av_frame_unref(frame);
}
return 0;
}
int main (int argc, char **argv)
{
int ret = 0;
AVFormatContext *fmt_ctx = NULL;
AVCodecContext *dec_ctx = NULL;
AVFrame *frame = NULL;
AVPacket *pkt = NULL;
if (argc != 3) {
exit(1);
}
int stream_idx = atoi(argv[2]);
/* open input file, and allocate format context */
avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);
/* get the stream */
AVStream *st = fmt_ctx->streams[stream_idx];
/* find a decoder for the stream */
AVCodec *dec = avcodec_find_decoder(st->codecpar->codec_id);
/* allocate a codec context for the decoder */
dec_ctx = avcodec_alloc_context3(dec);
/* copy codec parameters from input stream to output codec context */
avcodec_parameters_to_context(dec_ctx, st->codecpar);
/* init the decoder */
avcodec_open2(dec_ctx, dec, NULL);
/* allocate frame and packet structs */
frame = av_frame_alloc();
pkt = av_packet_alloc();
/* read frames from the specified stream */
while (av_read_frame(fmt_ctx, pkt) >= 0) {
if (pkt->stream_index == stream_idx)
ret = decode_packet(dec_ctx, pkt, frame);
av_packet_unref(pkt);
if (ret < 0)
break;
}
/* flush the decoders */
decode_packet(dec_ctx, NULL, frame);
return ret < 0;
}
I tried measuring parts of this program to see if it was spending a lot of time in the setup, but it's not – at least 1.5 seconds of the runtime is the loop where it's reading frames.
So I took some flamegraph recordings (using cargo-flamegraph) and ran each a few times to make sure the timing was consistent. There's probably some overhead since both were consistently higher than running normally, but they still have the ~0.3 second delta.
# 1.812 total
time sudo flamegraph ./minimal file 1
# 1.542 total
time sudo flamegraph ffmpeg -threads 1 -i file -map 0:a:0 -f null - 2>&1
Here are the flamegraphs stacked up, scaled so that the faster one is only 85% as wide as the slower one. (click for larger)
The interesting thing that stands out to me is how long is spent on read in the minimal example vs. ffmpeg:
The time spent on lseek is also a lot longer in the minimal program – it's plainly visible in that flamegraph, but in the ffmpeg flamegraph, lseek is a single pixel wide.
What's causing this discrepancy? Is ffmpeg actually doing less work than I think it is here? Is the minimal code doing something naive? Is there some buffering or other I/O optimizations that ffmpeg has enabled?
How can I shave 0.3 seconds off of the minimal example's runtime?
The difference is that ffmpeg, when run with the -map flag, is explicitly setting the AVDISCARD_ALL flag on the streams that were going to be ignored. The packets for those streams still get read from disk, but with this flag set, they never make it into av_read_frame (with the mov demuxer, at least).
In the example code, by contrast, this while loop receives every packet from every stream, and only drops the packets after they've been (wastefully) passed through av_read_frame.
/* read frames from the specified stream */
while (av_read_frame(fmt_ctx, pkt) >= 0) {
if (pkt->stream_index == stream_idx)
ret = decode_packet(dec_ctx, pkt, frame);
av_packet_unref(pkt);
if (ret < 0)
break;
}
I changed the program to set the discard flag on the unused streams:
// ...
/* open input file, and allocate format context */
avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);
/* get the stream */
AVStream *st = fmt_ctx->streams[stream_idx];
/* discard packets from other streams */
for(int i = 0; i < fmt_ctx->nb_streams; i++) {
fmt_ctx->streams[i]->discard = AVDISCARD_ALL;
}
st->discard = AVDISCARD_DEFAULT;
// ...
With that change in place, it gives about a ~1.8x speedup on the same test file, after the cache is warmed up.
Minimal example, without discard 1.593s
ffmpeg with -map 0:a:0 1.404s
Minimal example, with discard 0.898s

Extract raw I frame image data from MPEG-2 Transport Stream (H.264 - Annex B) byte stream

Context
I'm attempting to extract raw image data for each I-frame from an MPEG-2 Transport Stream with a H.264 annex B codec. This video contains I-frames on every 2 second interval. I've read that an I-frame can be found in after an NALu start code with a type of 5 (e.g. Coded slice of an IDR picture). The byte payload of these NALu's contains all the necessary data to construct a full frame. Albeit to, my understanding, in a H.264 encoded format.
I would like to build a solution to extract these I-frame from an incoming byte stream, by finding NALu's that contain I-frames, saving the payload and decoding the payload into some ubiquitous raw image format to access pixel data etc.
Note: I would like to avoid using filesystem dependency binaries like ffmpeg if possible and more importantly if feasible!
PoC
So far I have build an PoC in rust to find the byte offset, and byte size of I-frames:
use std::fs::File;
use std::io::{prelude::*, BufReader};
extern crate image;
fn main() {
let file = File::open("vodpart-0.ts").unwrap();
let reader = BufReader::new(file);
let mut idr_payload = Vec::<u8>::new();
let mut total_idr_frame_count = 0;
let mut is_idr_payload = false;
let mut is_nalu_type_code = false;
let mut start_code_vec = Vec::<u8>::new();
for (pos, byte_result) in reader.bytes().enumerate() {
let byte = byte_result.unwrap();
if is_nalu_type_code {
is_idr_payload = false;
is_nalu_type_code = false;
start_code_vec.clear();
if byte == 101 {
is_idr_payload = true;
total_idr_frame_count += 1;
println!("Found IDR picture at byte offset {}", pos);
}
continue;
}
if is_idr_payload {
idr_payload.push(byte);
}
if byte == 0 {
start_code_vec.push(byte);
continue;
}
if byte == 1 && start_code_vec.len() >= 2 {
if is_idr_payload {
let payload = idr_payload.len() - start_code_vec.len() + 1;
println!("Previous NALu payload is {} bytes long\n", payload);
save_image(&idr_payload.as_slice(), total_idr_frame_count);
idr_payload.clear();
}
is_nalu_type_code = true;
continue;
}
start_code_vec.clear();
}
println!();
println!("total i frame count: {}", total_idr_frame_count);
println!();
println!("done!");
}
fn save_image(buffer: &[u8], index: u16) {
let image_name = format!("image-{}.jpg", index);
image::save_buffer(image_name, buffer, 858, 480, image::ColorType::Rgb8).unwrap()
}
The result of which looks like:
Found IDR picture at byte offset 870
Previous NALu payload is 202929 bytes long
Found IDR picture at byte offset 1699826
Previous NALu payload is 185069 bytes long
Found IDR picture at byte offset 3268686
Previous NALu payload is 145218 bytes long
Found IDR picture at byte offset 4898270
Previous NALu payload is 106114 bytes long
Found IDR picture at byte offset 6482358
Previous NALu payload is 185638 bytes long
total i frame count: 5
done!
This is correct, based on my research using H.264 bit stream viewers etc. there are definitely 5 I-frames at those byte offsets!
The issue is that I don't understand how to convert from the H.264 bytestream payload to the raw image RBG data format. The resulting images once converted to jpg are just a fuzzy mess that takes up roughly 10% of the image area.
For example:
Questions
Is there a decoding step that needs to be performed?
Am I approaching this correctly and is this feasible to attempt myself, or should I be relying on another lib?
Any help would be greatly appreciated!
“ Is there a decoding step that needs to be performed?”
Yes. And writing a decoder from scratch is EXTREMELY complicated. The document that describes it (ISO 14496-10) is over 750 pages long. You should use a library. Libavcodec from the ffmpeg is really your only option. (Unless you only need baseline profile, in which you can use the open source decoder from android)
You can compile a custom version of libavcodec to exclude things you don’t need.

FFmpeg transcoded sound (AAC) stops after half video time

I have a strange problem in my C/C++ FFmpeg transcoder, which takes an input MP4 (varying input codecs) and produces and output MP4 (x264, baseline & AAC LC #44100 sample rate with libfdk_aac):
The resulting mp4 video has fine images (x264) and the audio (AAC LC) works fine as well, but is only played until exactly the half of the video.
The audio is not slowed down, not stretched and doesn't stutter. It just stops right in the middle of the video.
One hint may be that the input file has a sample rate of 22050 and 22050/44100 is 0.5, but I really don't get why this would make the sound just stop after half the time. I'd expect such an error leading to sound being at the wrong speed. Everything works just fine if I don't try to enforce 44100 and instead just use the incoming sample_rate.
Another guess would be that the pts calculation doesn't work. But the audio sounds just fine (until it stops) and I do exactly the same for the video part, where it works flawlessly. "Exactly", as in the same code, but "audio"-variables replaced with "video"-variables.
FFmpeg reports no errors during the whole process. I also flush the decoders/encoders/interleaved_writing after all the package reading from the input is done. It works well for the video so I doubt there is much wrong with my general approach.
Here are the functions of my code (stripped off the error handling & other class stuff):
AudioCodecContext Setup
outContext->_audioCodec = avcodec_find_encoder(outContext->_audioTargetCodecID);
outContext->_audioStream =
avformat_new_stream(outContext->_formatContext, outContext->_audioCodec);
outContext->_audioCodecContext = outContext->_audioStream->codec;
outContext->_audioCodecContext->channels = 2;
outContext->_audioCodecContext->channel_layout = av_get_default_channel_layout(2);
outContext->_audioCodecContext->sample_rate = 44100;
outContext->_audioCodecContext->sample_fmt = outContext->_audioCodec->sample_fmts[0];
outContext->_audioCodecContext->bit_rate = 128000;
outContext->_audioCodecContext->strict_std_compliance = FF_COMPLIANCE_EXPERIMENTAL;
outContext->_audioCodecContext->time_base =
(AVRational){1, outContext->_audioCodecContext->sample_rate};
outContext->_audioStream->time_base = (AVRational){1, outContext->_audioCodecContext->sample_rate};
int retVal = avcodec_open2(outContext->_audioCodecContext, outContext->_audioCodec, NULL);
Resampler Setup
outContext->_audioResamplerContext =
swr_alloc_set_opts( NULL, outContext->_audioCodecContext->channel_layout,
outContext->_audioCodecContext->sample_fmt,
outContext->_audioCodecContext->sample_rate,
_inputContext._audioCodecContext->channel_layout,
_inputContext._audioCodecContext->sample_fmt,
_inputContext._audioCodecContext->sample_rate,
0, NULL);
int retVal = swr_init(outContext->_audioResamplerContext);
Decoding
decodedBytes = avcodec_decode_audio4( _inputContext._audioCodecContext,
_inputContext._audioTempFrame,
&p_gotAudioFrame, &_inputContext._currentPacket);
Converting (only if decoding produced a frame, of course)
int retVal = swr_convert( outContext->_audioResamplerContext,
outContext->_audioConvertedFrame->data,
outContext->_audioConvertedFrame->nb_samples,
(const uint8_t**)_inputContext._audioTempFrame->data,
_inputContext._audioTempFrame->nb_samples);
Encoding (only if decoding produced a frame, of course)
outContext->_audioConvertedFrame->pts =
av_frame_get_best_effort_timestamp(_inputContext._audioTempFrame);
// Init the new packet
av_init_packet(&outContext->_audioPacket);
outContext->_audioPacket.data = NULL;
outContext->_audioPacket.size = 0;
// Encode
int retVal = avcodec_encode_audio2( outContext->_audioCodecContext,
&outContext->_audioPacket,
outContext->_audioConvertedFrame,
&p_gotPacket);
// Set pts/dts time stamps for writing interleaved
av_packet_rescale_ts( &outContext->_audioPacket,
outContext->_audioCodecContext->time_base,
outContext->_audioStream->time_base);
outContext->_audioPacket.stream_index = outContext->_audioStream->index;
Writing (only if encoding produced a packet, of course)
int retVal = av_interleaved_write_frame(outContext->_formatContext, &outContext->_audioPacket);
I am quite out of ideas about what would cause such a behaviour.
So, I finally managed to figure things out myself.
The problem was indeed in the difference of the sample_rate.
You'd assume that a call to swr_convert() would give you all the samples you need for converting the audio frame when called like I did.
Of course, that would be too easy.
Instead, you need to call swr_convert (potentially) multiple times per frame and buffer its output, if required. Then you need to grab a single frame from the buffer and that is what you will have to encode.
Here is my new convertAudioFrame function:
// Calculate number of output samples
int numOutputSamples = av_rescale_rnd(
swr_get_delay(outContext->_audioResamplerContext, _inputContext._audioCodecContext->sample_rate)
+ _inputContext._audioTempFrame->nb_samples,
outContext->_audioCodecContext->sample_rate,
_inputContext._audioCodecContext->sample_rate,
AV_ROUND_UP);
if (numOutputSamples == 0)
{
return;
}
uint8_t* tempSamples;
av_samples_alloc( &tempSamples, NULL,
outContext->_audioCodecContext->channels, numOutputSamples,
outContext->_audioCodecContext->sample_fmt, 0);
int retVal = swr_convert( outContext->_audioResamplerContext,
&tempSamples,
numOutputSamples,
(const uint8_t**)_inputContext._audioTempFrame->data,
_inputContext._audioTempFrame->nb_samples);
// Write to audio fifo
if (retVal > 0)
{
retVal = av_audio_fifo_write(outContext->_audioFifo, (void**)&tempSamples, retVal);
}
av_freep(&tempSamples);
// Get a frame from audio fifo
int samplesAvailable = av_audio_fifo_size(outContext->_audioFifo);
if (samplesAvailable > 0)
{
retVal = av_audio_fifo_read(outContext->_audioFifo,
(void**)outContext->_audioConvertedFrame->data,
outContext->_audioCodecContext->frame_size);
// We got a frame, so also set its pts
if (retVal > 0)
{
p_gotConvertedFrame = 1;
if (_inputContext._audioTempFrame->pts != AV_NOPTS_VALUE)
{
outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pts;
}
else if (_inputContext._audioTempFrame->pkt_pts != AV_NOPTS_VALUE)
{
outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pkt_pts;
}
}
}
This function I basically call until there are no more frame in the audio fifo buffer.
So, the audio was only half as long because I only encoded as many frames as I decoded. Where I actually needed to encode 2 times as many frames due to 2 times the sample_rate.

Is it possible to decode MPEG4 frames without delay with ffmpeg?

I use ffmpeg's MPEG4 decoder. The decoder has CODEC_CAP_DELAY capability among others. It means the decoder will give me decoded frames with latency of 1 frame.
I have a set of MPEG4 (I- & P- )frames from AVI file and feed ffmpeg decoder with these frames. For the very first I-frame decoder gives me nothing, but decodes the frames successfully. I can force the decoder to get the decoded frame with the second call of avcodec_decode_video2 and providing nulls (flush it), but if I do so for each frame I get artifacts for the first group of pictures (e.g. second decoded P-frame is of gray color).
If I do not force ffmpeg decoder to give me decoded frame right now, then it works flawlessly and without artifacts.
Question: But is it possible to get decoded frame without giving the decoder next frame and without artifacts?
Small example of how decoding is implemented for each frame:
// decode
int got_frame = 0;
int err = 0;
int tries = 5;
do
{
err = avcodec_decode_video2(m_CodecContext, m_Frame, &got_frame, &m_Packet);
/* some codecs, such as MPEG, transmit the I and P frame with a
latency of one frame. You must do the following to have a
chance to get the last frame of the video */
m_Packet.data = NULL;
m_Packet.size = 0;
--tries;
}
while (err >= 0 && got_frame == 0 && tries > 0);
But as I said that gave me artifacts for the first gop.
Use the "-flags +low_delay" option (or in code, set AVCodecContext.flags |= CODEC_FLAG_LOW_DELAY).
I tested several options and "-flags low_delay" and "-probesize 32" is more important than others. bellow code worked for me.
AVDictionary* avDic = nullptr;
av_dict_set(&avDic, "flags", "low_delay", 0);
av_dict_set(&avDic, "probesize", "32", 0);
const int errorCode = avformat_open_input(&pFormatCtx, mUrl.c_str(), nullptr, &avDic);

Reading RTSP stream with FFMpeg library - how to use avcodec_open2?

While trying to read rtsp stream I get some problems, with code and documentation alike. Short description: whatever I do, avcodec_open2 either fails (saying "codec type or id mismatches") or width and height of codec context after the call are 0 (thus making further code useless). Stream itself can be opened normally by VLC player and av_dump_format() displays correct info. My code is based on technique answer to this question.
Long description: my code is in C#, but here is C++-equivalent of FFMpeg calls (I actually reduced my code to this minimum and problem persists):
av_register_all();
avformat_network_init(); //return code ignored
AVFormatContext* formatContext = avformat_alloc_context();
if (avformat_open_input(&formatContext, stream_path, null, null) != 0) {
return;
}
if (avformat_find_stream_info(formatContext, null) < 0) {
return;
}
int videoStreamIndex = 0;
for (int i = 0; i < formatContext->nb_streams; ++i) {
AVStream* s = formatContext->streams[i];
if (s->codec == null) continue;
AVCodecContext c = *(s->codec);
if (c.codec_type == AVMEDIA_TYPE_VIDEO) videoStreamIndex = i;
}
//start reading packets from stream and write them to file
//av_read_play(formatContext); //return code ignored
//this call would print "method PLAY failed: 455 Method Not Valid in This State"
//seems to be the case that for rtsp stream it isn't needed
AVCodec* codec = null;
codec = avcodec_find_decoder(AV_CODEC_ID_H264);
if (codec == null) {
return;
}
AVCodecContext* codecContext = avcodec_alloc_context3(null);
avcodec_get_context_defaults3(codecContext, codec);//return code ignored
avcodec_copy_context(codecContext, formatContext->streams[videoStreamIndex]->codec); //return code ignored
av_dump_format(formatContext, videoStreamIndex, stream_path, 0);
if (avcodec_open2(codecContext, codec, null) < 0) {
return;
}
The code actually uses DLL version of FFMpeg library; avcodec-55.dll and avformat-55.dll are used.
Documentation says something weird about which calls can be made in which succession (that copy_context should be called before get_context_defaults), current code is left close as possible to technique version. As written, it results in non-zero return from avcodec_open2 with "codec type or id mismatches" message. Changing the order does little good: now avcodec_open2 executes successfully, but both codecContext->width and codecContext->height are 0 afterwards.
Also documentation doesn't mention which is default value for the third argument of avcodec_open2 should be, but source code seems to taking into account that options can be NULL.
Output of av_dump_format is as follows:
Input #0, rtsp, from 'rtsp://xx.xx.xx.xx:xx/video.pro1':
Metadata:
title : QStream
comment : QStreaming Media
Duration: N/A, start: 0.000000, bitrate: 64 kb/s
Stream #0:0: Video: h264 (Baseline), yuvj420p(pc), 1920x1080, 30 fps, 25 tbr, 90k tbn, 60 tbc
Stream #0:1: Audio: pcm_mulaw, 8000 Hz, 1 channels, s16, 64 kb/s
First, what does the av_dump_format shows? Are you sure your video stream codec is h264, because you try to open the codec as if it were H264.
In order to open any codec, change your avcodec_find_decoder to pass it the source codec id:
codec = avcodec_find_decoder(formatContext->streams[videoStreamIndex]->codec->codec_id);
By the way, (forget this one if you do not use the c++ code but stick with c#): you do not need to make a copy of the initial AVCodecContext when you are looking for the video stream. You can do: (note that you may want to keep a pointer to the inital codec context, see below).
AVCodecContext* c = s->codec;
if (c->codec_type == AVMEDIA_TYPE_VIDEO) {
videoStreamIndex = i;
initialVideoCodecCtx = c;
}
Next point, not really relevant in this case: instead of looping through all the steams, FFmpeg has a helper function for it:
int videoStreamIndex = av_find_best_stream(formatContext, AVMEDIA_TYPE_VIDEO, -1, -1, NULL, 0);
Last point: I think only the first point should do the trick to make avcodec_open2 work, but you might not be able to decode your stream. You opened the codec for the new codec context, but no codec is opened for the inital context. Why did you make a copy of the initial codec context? It is usefull if you want to record your stream in another file (i.e. transcode), but if you only want to decode your stream, it is much easier to use the initial context, and use it instead of the new one as a parameter for avcodec_decode_video2.
To sum it up, replace your code after avformat_find_stream_info by (warning: no error check):
int videoStreamIndex = av_find_best_stream(formatContext, AVMEDIA_TYPE_VIDEO, -1, -1, NULL, 0);
AVCodecContext* codecCtx = formatContext->streams[videoStreamIndex]->codec;
AVCodec* codec = avcodec_find_decoder(codecCtx->codec_id);
// tune codecCtx if you want special decoding options. See FFmpeg docs for a list of members
if (avcodec_open2(codecCtx, codec, null) < 0) {
return;
}
// use av_read_frame(formatContext, ...) to read packets
// use avcodec_decode_video2(codecCtx, ...) to decode packets
If avcodec_open2 does not fail, and you still see width and height being 0 this might be expected. Notice that the stream (frame) dimensions are not always known until you actually start decoding.
You should use the AVFrame values in order to initialize your decoding buffers, after your first avcodec_decode_video2 decoding call.

Resources