avcodec/avformat Number of Frames in File - ffmpeg

I need to know how to find the total number of frames in a video file using avcodec/avformat.
I have a project that I'm picking up from someone else who was using the ffmpeg libraries to decode video streams. I need to retrofit some functionality to seek around frame by frame, and so my first task is to simply figure out the total number of frames in the file. AVStream.nb_frames seems to be a reasonable place to look, but this is always 0 with all of the video files I've tried. Is deducing the total number of frames from AVFormatContex.duration the best way to go?

The only way to find the exact number of frames is to go through them all and count. I have worried about this many times, tried many different tools (including ffmpeg), and read a lot. Sorry, but in the general case there's no other way. Some formats just don't store this information, so you have to count.

Related

FFmpeg (HLS) - Split segments by size rather than time (or limit segment size)

So I have a ffmpeg command with the following parameters (in NodeJS, but it doesn't matter):
'-start_number 0',
'-hls_key_info_file ' + HLSKEY_PATH,
'-hls_time 5',
'-hls_playlist_type vod',
'-hls_segment_filename seg-%d.ts'
Everything's working fine, but there's a problem with the size of each segment.
Why can't ffmpeg create segments by size rather than time? In this case -hls_time 5. I was hoping doing something like -hls_size 4096 but there's no option for that, at least from what I know. Or can I at least set a limit? Like: "Don't create segments bigger than 4MB, split them again if necessary."
I know I could encode the file again and then there wouldn't be segments which vary in size that much, but this is no option and alternative for me.
Thanks in advance!
Not ALL frames have the same amount of information in them. Video compression works in a way that some frames have LESS information and others have MORE. Technically, then, if the segment duration in milliseconds is ALWAYS the same, there is no way the fragment sizes will be the same.
I think there is no option for what you're asking.
There is a "-fs" (output file size) command, but I think it works to limit the size of the whole output, and not just segments.
In your place, if the segment sizes are bothering me, I would just lower the duration by one second, then two, and see. Good luck.

Comparing bytes from to wav files

If I were to save an array of bytes from a recorded wave file, of someone saying their name for example. Then compare the array of bytes in a new wav file of the same person saying their name to the original one, would I be able to tell it was the same person saying their name?
No, there are so many factors that simply comparing the file will not suffice.
There's background noise and static, sample rate could be off from the first one, they could have their mouth closer to the microphone then last time..and that's without starting the technical stuff like your definition of 'compare', plus how many bytes you would have to 'compare' would be ridiculous - each byte is so small in relation to the whole thing they basically mean nothing individually.

Does the order of data in a text file affects its compression ratio?

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.

How do I create timestamps in my decoder filter?

Thanks to Roman R's answer to my previous question I now have an asynchronous filter wrapping a 3rd party decoder library.
The encoded input samples are coming from a network source. At present I am not adding timestamps to the decoded frames so the framerate is rather jerky as it is dependent on the time the data packets are received.
When the library decodes a full frame it also provides a UTC timestamp of the time the frame was captured according to the clock on the source encoder.
The question is: how can I related this to the stream time and create a sensible value for the SetTime function? I've played around with it but what ever values I put in just seem to lock up the filter graph at the CBaseOutputPin::Deliver function.
The easiest time stamping is as follows. You time stamp your first media sample with a zero (see adjustment note in next paragraph) and the following will be stamped with a difference against it. That is, you start streaming and you obtain first sample from your network source, you remember this time UTC0 and attach zero to the DirectShow media sample. Following frames 1, 2 ... N with UTC times UTC1... UTCN and will be converted to DirectShow time UTCN - UTC0. You might need an additional conversion to proper units, as DirectShow will need 100 ns units and your network source might be giving you something like 1/90000 s.
Since your source is perhaps a live source, and your first frame might be obtained not exactly at graph run time, you might be adjusting resulting media sample time stamp using a difference between current filter graph's IReferenceClock::GetTime and time received as argument to IBaseFilter::Run call.

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track.
I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?
Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.
That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.
Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.
Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)
Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.
If the chunk matches well enough, see if it matches all the way out to silence.
The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.
The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.
If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
ref: --part1part2--
syn: ---part1---part2----
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in syn and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".
For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.

Resources