I'm wanting to read the audio out of a video file as fast as possible, using the libav libraries. It's all working fine, but it seems like it could be faster.
To get a performance baseline, I ran this ffmpeg command and timed it:
time ffmpeg -threads 1 -i file -map 0:a:0 -f null -
On a test file (a 2.5gb 2hr .MOV with pcm_s16be audio) this comes out to about 1.35 seconds on my M1 Macbook Pro.
On the other hand, this minimal C code (based on FFmpeg's "Demuxing and decoding" example) is consistently around 0.3 seconds slower.
#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>
static int decode_packet(AVCodecContext *dec, const AVPacket *pkt, AVFrame *frame)
{
int ret = 0;
// submit the packet to the decoder
ret = avcodec_send_packet(dec, pkt);
// get all the available frames from the decoder
while (ret >= 0) {
ret = avcodec_receive_frame(dec, frame);
av_frame_unref(frame);
}
return 0;
}
int main (int argc, char **argv)
{
int ret = 0;
AVFormatContext *fmt_ctx = NULL;
AVCodecContext *dec_ctx = NULL;
AVFrame *frame = NULL;
AVPacket *pkt = NULL;
if (argc != 3) {
exit(1);
}
int stream_idx = atoi(argv[2]);
/* open input file, and allocate format context */
avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);
/* get the stream */
AVStream *st = fmt_ctx->streams[stream_idx];
/* find a decoder for the stream */
AVCodec *dec = avcodec_find_decoder(st->codecpar->codec_id);
/* allocate a codec context for the decoder */
dec_ctx = avcodec_alloc_context3(dec);
/* copy codec parameters from input stream to output codec context */
avcodec_parameters_to_context(dec_ctx, st->codecpar);
/* init the decoder */
avcodec_open2(dec_ctx, dec, NULL);
/* allocate frame and packet structs */
frame = av_frame_alloc();
pkt = av_packet_alloc();
/* read frames from the specified stream */
while (av_read_frame(fmt_ctx, pkt) >= 0) {
if (pkt->stream_index == stream_idx)
ret = decode_packet(dec_ctx, pkt, frame);
av_packet_unref(pkt);
if (ret < 0)
break;
}
/* flush the decoders */
decode_packet(dec_ctx, NULL, frame);
return ret < 0;
}
I tried measuring parts of this program to see if it was spending a lot of time in the setup, but it's not – at least 1.5 seconds of the runtime is the loop where it's reading frames.
So I took some flamegraph recordings (using cargo-flamegraph) and ran each a few times to make sure the timing was consistent. There's probably some overhead since both were consistently higher than running normally, but they still have the ~0.3 second delta.
# 1.812 total
time sudo flamegraph ./minimal file 1
# 1.542 total
time sudo flamegraph ffmpeg -threads 1 -i file -map 0:a:0 -f null - 2>&1
Here are the flamegraphs stacked up, scaled so that the faster one is only 85% as wide as the slower one. (click for larger)
The interesting thing that stands out to me is how long is spent on read in the minimal example vs. ffmpeg:
The time spent on lseek is also a lot longer in the minimal program – it's plainly visible in that flamegraph, but in the ffmpeg flamegraph, lseek is a single pixel wide.
What's causing this discrepancy? Is ffmpeg actually doing less work than I think it is here? Is the minimal code doing something naive? Is there some buffering or other I/O optimizations that ffmpeg has enabled?
How can I shave 0.3 seconds off of the minimal example's runtime?
The difference is that ffmpeg, when run with the -map flag, is explicitly setting the AVDISCARD_ALL flag on the streams that were going to be ignored. The packets for those streams still get read from disk, but with this flag set, they never make it into av_read_frame (with the mov demuxer, at least).
In the example code, by contrast, this while loop receives every packet from every stream, and only drops the packets after they've been (wastefully) passed through av_read_frame.
/* read frames from the specified stream */
while (av_read_frame(fmt_ctx, pkt) >= 0) {
if (pkt->stream_index == stream_idx)
ret = decode_packet(dec_ctx, pkt, frame);
av_packet_unref(pkt);
if (ret < 0)
break;
}
I changed the program to set the discard flag on the unused streams:
// ...
/* open input file, and allocate format context */
avformat_open_input(&fmt_ctx, argv[1], NULL, NULL);
/* get the stream */
AVStream *st = fmt_ctx->streams[stream_idx];
/* discard packets from other streams */
for(int i = 0; i < fmt_ctx->nb_streams; i++) {
fmt_ctx->streams[i]->discard = AVDISCARD_ALL;
}
st->discard = AVDISCARD_DEFAULT;
// ...
With that change in place, it gives about a ~1.8x speedup on the same test file, after the cache is warmed up.
Minimal example, without discard 1.593s
ffmpeg with -map 0:a:0 1.404s
Minimal example, with discard 0.898s
Related
On my existing PC/IP Camera combination, the command line function
ffmpeg -i rtsp://abcd:123456#1.2.3.4/rtspvideostream /home/pc/video.avi
correctly writes the video stream to file and uses approximately 30% of my CPU.
The command line function
ffmpeg -i rtsp://abcd:123456#1.2.3.4/rtspvideostream -vcodec copy /home/pc/video.avi
uses approximately 3% of my CPU for seemingly the same result. I assume the removal of some functionality related to the codec contributes to this CPU saving.
Using the following standard ffmpeg initialisation:
AVFormatContext *pFormatCtx;
AVCodecContext *pCodecCtx;
AVDictionary *opts = NULL;
av_dict_set(&opts, "rtsp_transport", "tcp", 0);
avformat_open_input(&pFormatCtx,"rtsp://abcd:123456#1.2.3.4/rtspvideostream", NULL, &opts);
avformat_find_stream_info(pFormatCtx,NULL);
int videoStream = -1;
for(int i=0; i<(int)pFormatCtx->nb_streams; i++)
{
if(pFormatCtx->streams[i]->codec->codec_type== AVMEDIA_TYPE_VIDEO)
{
videoStream=i;
break;
}
}
pCodecCtx=pFormatCtx->streams[videoStream]->codec;
pCodec=avcodec_find_decoder(pCodecCtx->codec_id);
avcodec_open2(pCodecCtx, pCodec, NULL)
AVFrame *pFrame = av_frame_alloc();
int numBytes = avpicture_get_size(AV_PIX_FMT_YUV420P, IMAGEWIDTH, IMAGEHEIGHT);
uint8_t *buffer12 = (uint8_t*)av_malloc(numBytes*sizeof(uint8_t));
avpicture_fill((AVPicture *)pFrame, buffer12, AV_PIX_FMT_YUV420P, IMAGEWIDTH, IMAGEHEIGHT);
with the standard reading implementation:
int frameFinished = 0;
while(frameFinished == 0)
{
av_read_frame(pFormatCtx, &packet);
if(packet.stream_index==videoStream)
{
avcodec_decode_video2(pCodecCtx, pFrame, &frameFinished, &packet);
}
if( packet.duration ) av_free_packet(&packet);
packet.duration = 0;
}
correctly gets the video stream and uses approximately 30% of CPU.
In the command-line ffmpeg implementation, the addition of the parameters '-vcodec copy' dramatically decreases CPU usage. I am unable to reproduce a similar drop in CPU usage for the above coding implementation.
Assuming it is possible, how do I do it ?
The -vcodec copy implies that the video stream is copied as is (it is not being decoded/encdoed) so the CPU usage is pure I/O. To do the same you should remove all the codec opening stuff and decoding (avcodec_decode_video2) from your code and just write the packet directly to the output stream.
Am Currently updating our FFMPEG library usage from a pretty old version(0.5) to 2.8. As part of the change, had replaced avcodec_decode_video to avcodec_decode_video2. However, am noticing quite a difference in the way avcodec_decode_video2 functions compared to the old avcodec_decode_video. For the same packet (same data), 'avcodec_decode_video2' gives got_picture_ptr as zeo whereas the old 'avcodec_decode_video' was giving a non-zero value. In the example that am describing here, am decoding an FLV file with VideoCodec:H264-MPEG-4 AVC (part 10) and AudioCodec:MPEG AAC Audio (Am attaching a part of the Hex Version of the FLV file in FLV_Sample.Hex FLV_Sample_Hex). The original flv file is too large). For the first AVPacket (obtained from av_read_frame), got_picture_ptr from 'avcodec_decode_video2' is zero but old 'avcodec_decode_video' gives 296(Am attaching the entire AVPacket data obtained and the outputs obtained from the two functions in the file FFMPEG_Decoding_Packet_Info.txt FFMPEG_Decoding_Packet_Info). Continuing on, the new 'avcodec_decode_video2' keeps giving 'Zero' till the 23rd Packet where it gives 1. So its not like avcodec_decode_video2 keeps giving zero. My main dilemma is that am not sure if this difference in behaviour is due to the changes in 'avcodec_decode_video2' or any errors that I have made in using the Decoder. I have put a snippet of the code that am using to use the decoder below. Any suggestions will be helpful.
AVFormatContext *pFormatCtx;
AVCodecContext *pCodecCtx;
AVCodec *pCodec;
AVFrame *pFrameRGB;
#if FFMPEG_2_8
avformat_open_input(&pFormatCtx, strFileName, NULL, NULL) ;
#else
av_open_input_file(&pFormatCtx, strFileName, NULL, 0, NULL) ;
#endif //FFMPEG_2_8
size_t videoStream=pFormatCtx->nb_streams;
bool streamFound = false ;
for(size_t i=0; i<pFormatCtx->nb_streams; i++)
{
#if FFMPEG_2_8
if(pFormatCtx->streams[i]->codec->codec_type==AVMEDIA_TYPE_VIDEO)
#else
if(pFormatCtx->streams[i]->codec->codec_type==CODEC_TYPE_VIDEO)
#endif //FFMPEG_2_8
{
videoStream = i;
streamFound = true ;
break;
}
}
if(streamFound)
{
pCodecCtx=pFormatCtx->streams[videoStream]->codec;
// Find the decoder for the video stream
pCodec=avcodec_find_decoder(pCodecCtx->codec_id);
if(pCodec==NULL)
return false; // Codec not found
// Open codec
#if FFMPEG_2_8
if(avcodec_open2(pCodecCtx, pCodec,NULL)<0)
#else
if(avcodec_open(pCodecCtx, pCodec)<0)
#endif //FFMPEG_2_8
{
return false; // Could not open codec
}
#if FFMPEG_2_8
pFrameRGB=av_frame_alloc() ;
#else
pFrameRGB=avcodec_alloc_frame();
#endif //FFMPEG_2_8
if(pFrameRGB==NULL)
return false; //No Memory
while(true)
{
AVPacket packet ;
if (av_read_frame(pFormatCtx, &packet) < 0)
{
break ;
}
int frameFinished;
if (packet.stream_index == videoStream)
{
#if FFMPEG_2_8
avcodec_decode_video2(pCodecCtx, pFrameRGB, &frameFinished, &packet);
#else
avcodec_decode_video(pCodecCtx, pFrameRGB, &frameFinished, packet.data, packet.size);
#endif //FFMPEG_2_8
}
if(frameFinished !=0)
{
break ;
}
}
}
I do have almost same implementation, working on latest 2.8.1 version. I have no idea about old version(0.5), but your implementation for new version seem to be fine.
I guess one thing about "got_picture_ptr ", not sure though. JFYI, decode order and display order are different in H264. May be earlier version of FFMPEG used to give out the pictures in decode order, rather than display order. In such case, you will see non-zero value for every packet decode, beginning from the first packet.
At some point, ffmpeg would have corrected this to give out the pictures in display order. Hence, you may not observe non-zero value from the very first packet decode.
I guess your application is working fine, irrespective of this difference.
I have a strange problem in my C/C++ FFmpeg transcoder, which takes an input MP4 (varying input codecs) and produces and output MP4 (x264, baseline & AAC LC #44100 sample rate with libfdk_aac):
The resulting mp4 video has fine images (x264) and the audio (AAC LC) works fine as well, but is only played until exactly the half of the video.
The audio is not slowed down, not stretched and doesn't stutter. It just stops right in the middle of the video.
One hint may be that the input file has a sample rate of 22050 and 22050/44100 is 0.5, but I really don't get why this would make the sound just stop after half the time. I'd expect such an error leading to sound being at the wrong speed. Everything works just fine if I don't try to enforce 44100 and instead just use the incoming sample_rate.
Another guess would be that the pts calculation doesn't work. But the audio sounds just fine (until it stops) and I do exactly the same for the video part, where it works flawlessly. "Exactly", as in the same code, but "audio"-variables replaced with "video"-variables.
FFmpeg reports no errors during the whole process. I also flush the decoders/encoders/interleaved_writing after all the package reading from the input is done. It works well for the video so I doubt there is much wrong with my general approach.
Here are the functions of my code (stripped off the error handling & other class stuff):
AudioCodecContext Setup
outContext->_audioCodec = avcodec_find_encoder(outContext->_audioTargetCodecID);
outContext->_audioStream =
avformat_new_stream(outContext->_formatContext, outContext->_audioCodec);
outContext->_audioCodecContext = outContext->_audioStream->codec;
outContext->_audioCodecContext->channels = 2;
outContext->_audioCodecContext->channel_layout = av_get_default_channel_layout(2);
outContext->_audioCodecContext->sample_rate = 44100;
outContext->_audioCodecContext->sample_fmt = outContext->_audioCodec->sample_fmts[0];
outContext->_audioCodecContext->bit_rate = 128000;
outContext->_audioCodecContext->strict_std_compliance = FF_COMPLIANCE_EXPERIMENTAL;
outContext->_audioCodecContext->time_base =
(AVRational){1, outContext->_audioCodecContext->sample_rate};
outContext->_audioStream->time_base = (AVRational){1, outContext->_audioCodecContext->sample_rate};
int retVal = avcodec_open2(outContext->_audioCodecContext, outContext->_audioCodec, NULL);
Resampler Setup
outContext->_audioResamplerContext =
swr_alloc_set_opts( NULL, outContext->_audioCodecContext->channel_layout,
outContext->_audioCodecContext->sample_fmt,
outContext->_audioCodecContext->sample_rate,
_inputContext._audioCodecContext->channel_layout,
_inputContext._audioCodecContext->sample_fmt,
_inputContext._audioCodecContext->sample_rate,
0, NULL);
int retVal = swr_init(outContext->_audioResamplerContext);
Decoding
decodedBytes = avcodec_decode_audio4( _inputContext._audioCodecContext,
_inputContext._audioTempFrame,
&p_gotAudioFrame, &_inputContext._currentPacket);
Converting (only if decoding produced a frame, of course)
int retVal = swr_convert( outContext->_audioResamplerContext,
outContext->_audioConvertedFrame->data,
outContext->_audioConvertedFrame->nb_samples,
(const uint8_t**)_inputContext._audioTempFrame->data,
_inputContext._audioTempFrame->nb_samples);
Encoding (only if decoding produced a frame, of course)
outContext->_audioConvertedFrame->pts =
av_frame_get_best_effort_timestamp(_inputContext._audioTempFrame);
// Init the new packet
av_init_packet(&outContext->_audioPacket);
outContext->_audioPacket.data = NULL;
outContext->_audioPacket.size = 0;
// Encode
int retVal = avcodec_encode_audio2( outContext->_audioCodecContext,
&outContext->_audioPacket,
outContext->_audioConvertedFrame,
&p_gotPacket);
// Set pts/dts time stamps for writing interleaved
av_packet_rescale_ts( &outContext->_audioPacket,
outContext->_audioCodecContext->time_base,
outContext->_audioStream->time_base);
outContext->_audioPacket.stream_index = outContext->_audioStream->index;
Writing (only if encoding produced a packet, of course)
int retVal = av_interleaved_write_frame(outContext->_formatContext, &outContext->_audioPacket);
I am quite out of ideas about what would cause such a behaviour.
So, I finally managed to figure things out myself.
The problem was indeed in the difference of the sample_rate.
You'd assume that a call to swr_convert() would give you all the samples you need for converting the audio frame when called like I did.
Of course, that would be too easy.
Instead, you need to call swr_convert (potentially) multiple times per frame and buffer its output, if required. Then you need to grab a single frame from the buffer and that is what you will have to encode.
Here is my new convertAudioFrame function:
// Calculate number of output samples
int numOutputSamples = av_rescale_rnd(
swr_get_delay(outContext->_audioResamplerContext, _inputContext._audioCodecContext->sample_rate)
+ _inputContext._audioTempFrame->nb_samples,
outContext->_audioCodecContext->sample_rate,
_inputContext._audioCodecContext->sample_rate,
AV_ROUND_UP);
if (numOutputSamples == 0)
{
return;
}
uint8_t* tempSamples;
av_samples_alloc( &tempSamples, NULL,
outContext->_audioCodecContext->channels, numOutputSamples,
outContext->_audioCodecContext->sample_fmt, 0);
int retVal = swr_convert( outContext->_audioResamplerContext,
&tempSamples,
numOutputSamples,
(const uint8_t**)_inputContext._audioTempFrame->data,
_inputContext._audioTempFrame->nb_samples);
// Write to audio fifo
if (retVal > 0)
{
retVal = av_audio_fifo_write(outContext->_audioFifo, (void**)&tempSamples, retVal);
}
av_freep(&tempSamples);
// Get a frame from audio fifo
int samplesAvailable = av_audio_fifo_size(outContext->_audioFifo);
if (samplesAvailable > 0)
{
retVal = av_audio_fifo_read(outContext->_audioFifo,
(void**)outContext->_audioConvertedFrame->data,
outContext->_audioCodecContext->frame_size);
// We got a frame, so also set its pts
if (retVal > 0)
{
p_gotConvertedFrame = 1;
if (_inputContext._audioTempFrame->pts != AV_NOPTS_VALUE)
{
outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pts;
}
else if (_inputContext._audioTempFrame->pkt_pts != AV_NOPTS_VALUE)
{
outContext->_audioConvertedFrame->pts = _inputContext._audioTempFrame->pkt_pts;
}
}
}
This function I basically call until there are no more frame in the audio fifo buffer.
So, the audio was only half as long because I only encoded as many frames as I decoded. Where I actually needed to encode 2 times as many frames due to 2 times the sample_rate.
I use ffmpeg's MPEG4 decoder. The decoder has CODEC_CAP_DELAY capability among others. It means the decoder will give me decoded frames with latency of 1 frame.
I have a set of MPEG4 (I- & P- )frames from AVI file and feed ffmpeg decoder with these frames. For the very first I-frame decoder gives me nothing, but decodes the frames successfully. I can force the decoder to get the decoded frame with the second call of avcodec_decode_video2 and providing nulls (flush it), but if I do so for each frame I get artifacts for the first group of pictures (e.g. second decoded P-frame is of gray color).
If I do not force ffmpeg decoder to give me decoded frame right now, then it works flawlessly and without artifacts.
Question: But is it possible to get decoded frame without giving the decoder next frame and without artifacts?
Small example of how decoding is implemented for each frame:
// decode
int got_frame = 0;
int err = 0;
int tries = 5;
do
{
err = avcodec_decode_video2(m_CodecContext, m_Frame, &got_frame, &m_Packet);
/* some codecs, such as MPEG, transmit the I and P frame with a
latency of one frame. You must do the following to have a
chance to get the last frame of the video */
m_Packet.data = NULL;
m_Packet.size = 0;
--tries;
}
while (err >= 0 && got_frame == 0 && tries > 0);
But as I said that gave me artifacts for the first gop.
Use the "-flags +low_delay" option (or in code, set AVCodecContext.flags |= CODEC_FLAG_LOW_DELAY).
I tested several options and "-flags low_delay" and "-probesize 32" is more important than others. bellow code worked for me.
AVDictionary* avDic = nullptr;
av_dict_set(&avDic, "flags", "low_delay", 0);
av_dict_set(&avDic, "probesize", "32", 0);
const int errorCode = avformat_open_input(&pFormatCtx, mUrl.c_str(), nullptr, &avDic);
I stream a video using libavformat as follows:
static AVStream *add_stream(AVFormatContext *oc, AVCodec **codec,
enum AVCodecID codec_id)
{
AVCodecContext *c;
AVStream *st;
/* find the encoder */
*codec = avcodec_find_encoder(codec_id);
if (!(*codec)) {
fprintf(stderr, "Could not find encoder for '%s'\n",
avcodec_get_name(codec_id));
exit(1);
}
st = avformat_new_stream(oc, *codec);
if (!st) {
fprintf(stderr, "Could not allocate stream\n");
exit(1);
}
st->id = oc->nb_streams-1;
c = st->codec;
switch ((*codec)->type) {
case AVMEDIA_TYPE_AUDIO:
c->sample_fmt = (*codec)->sample_fmts ?
(*codec)->sample_fmts[0] : AV_SAMPLE_FMT_FLTP;
c->bit_rate = 64000;
c->sample_rate = 44100;
c->channels = 2;
break;
case AVMEDIA_TYPE_VIDEO:
c->codec_id = codec_id;
c->bit_rate = 400000;
/* Resolution must be a multiple of two. */
c->width = outframe_width;
c->height = outframe_height;
/* timebase: This is the fundamental unit of time (in seconds) in terms
* of which frame timestamps are represented. For fixed-fps content,
* timebase should be 1/framerate and timestamp increments should be
* identical to 1. */
c->time_base.den = STREAM_FRAME_RATE;
c->time_base.num = 1;
c->gop_size = 12; /* emit one intra frame every twelve frames at most */
c->pix_fmt = STREAM_PIX_FMT;
if (c->codec_id == AV_CODEC_ID_MPEG2VIDEO) {
/* just for testing, we also add B frames */
c->max_b_frames = 2;
}
if (c->codec_id == AV_CODEC_ID_MPEG1VIDEO) {
/* Needed to avoid using macroblocks in which some coeffs overflow.
* This does not happen with normal video, it just happens here as
* the motion of the chroma plane does not match the luma plane. */
c->mb_decision = 2;
}
break;
default:
break;
}
/* Some formats want stream headers to be separate. */
if (oc->oformat->flags & AVFMT_GLOBALHEADER)
c->flags |= CODEC_FLAG_GLOBAL_HEADER;
return st;
}
But when I run this code, I get the following error/warning:
[mpeg # 01f3f040] VBV buffer size not set, muxing may fail
Do you know how I can set the VBV buffer size in the code? In fact, when I use ffplay to display the streamed video, ffplay doesn't show anything for short videos but for long videos, it start displaying the video immediately. So, it looks like ffplay needs a buffer to be filled up by some amount so that it can start displaying the stream. Am I right?
You can set the VBV buffer size of a stream with:
AVCPBProperties *props;
props = (AVCPBProperties*) av_stream_new_side_data(
st, AV_PKT_DATA_CPB_PROPERTIES, sizeof(*props));
props->buffer_size = 230 * 1024;
props->max_bitrate = 0;
props->min_bitrate = 0;
props->avg_bitrate = 0;
props->vbv_delay = UINT64_MAX;
Where st is a pointer to the AVStream struct. Min bit rate, max bit rate, and average bit rate are set to 0 while the VBV delay is set to UINT64_MAX in this example because those values indicate unknown or unspecified values for these fields (see AVCPB properties documentation). Alternatively, you set these values to whatever is reasonable for your specific use case. Just don't forget to assign these fields, because they will not be initialized automatically.