Pydub audiosegment output shorter than the sum of concatenated constitutent audiosegments - duration

I use Pydub to concatenate very short wav audiofiles (200ms) that include sounds and silences for an experiment. The sounds I use are previously created and manipulated in Audacity to have sepcific durations and charachteristics. These concatenated sounds should form an audiosegment of 2400ms (12 sound elements of 200ms). Nevertheless once I compute the new segments i get an output duration of 1375 ms.
I repeatedly checked the single duration of each audiosegment small chunk, and Pydub always returns me the correct 200ms duration.
It's very important that the duration precisely matches 2400ms for the purpose of the experiment

Related

How to increase sample rate of wav file and keep speech normal?

I'd like to increase the sample rate of a wav file using the ruby wavefile gem, but keep the voices in the wav file sounding normal. Basically this example, but modified so voices sound normal.
https://github.com/jstrait/wavefile/wiki/WaveFile-Tutorial#copying-a-wave-file-to-different-format
require 'wavefile'
include WaveFile
SAMPLES_PER_BUFFER = 4096
Writer.new("copy.wav", Format.new(:stereo, :pcm_16, 44100) do |writer|
Reader.new("original.wav").each_buffer(SAMPLES_PER_BUFFER) do |buffer|
writer.write(buffer)
end
end
You can think of a wav file as a collection of data points that are sampled at the given rate (the sample rate). If you have a given collection at a known sample rate and you wish to increase the resolution of samples per unit of time, then you will either need to take new samples (i.e. make a new original recording at the desired rate) or extrapolate/approximate the necessary samples at the new data points that would appear between the ones you already have.
For example, if you have a recording at 22khz and you want to increase it to 44khz, then one option is to simply take each data point and repeat it once so you wind up with something like the following:
```
Original waveform samples over time:
1.2.3.4.5.6
if you only change sample rate:
123456 (same sound in half the time)
new waveform samples over time:
112233445566
```

Comparing bytes from to wav files

If I were to save an array of bytes from a recorded wave file, of someone saying their name for example. Then compare the array of bytes in a new wav file of the same person saying their name to the original one, would I be able to tell it was the same person saying their name?
No, there are so many factors that simply comparing the file will not suffice.
There's background noise and static, sample rate could be off from the first one, they could have their mouth closer to the microphone then last time..and that's without starting the technical stuff like your definition of 'compare', plus how many bytes you would have to 'compare' would be ridiculous - each byte is so small in relation to the whole thing they basically mean nothing individually.

How do I create timestamps in my decoder filter?

Thanks to Roman R's answer to my previous question I now have an asynchronous filter wrapping a 3rd party decoder library.
The encoded input samples are coming from a network source. At present I am not adding timestamps to the decoded frames so the framerate is rather jerky as it is dependent on the time the data packets are received.
When the library decodes a full frame it also provides a UTC timestamp of the time the frame was captured according to the clock on the source encoder.
The question is: how can I related this to the stream time and create a sensible value for the SetTime function? I've played around with it but what ever values I put in just seem to lock up the filter graph at the CBaseOutputPin::Deliver function.
The easiest time stamping is as follows. You time stamp your first media sample with a zero (see adjustment note in next paragraph) and the following will be stamped with a difference against it. That is, you start streaming and you obtain first sample from your network source, you remember this time UTC0 and attach zero to the DirectShow media sample. Following frames 1, 2 ... N with UTC times UTC1... UTCN and will be converted to DirectShow time UTCN - UTC0. You might need an additional conversion to proper units, as DirectShow will need 100 ns units and your network source might be giving you something like 1/90000 s.
Since your source is perhaps a live source, and your first frame might be obtained not exactly at graph run time, you might be adjusting resulting media sample time stamp using a difference between current filter graph's IReferenceClock::GetTime and time received as argument to IBaseFilter::Run call.

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos.
From the video clips, I'll extract a reference audio track.
I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.
I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.
Example:
0 1 2
012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)
Output:
[(2,6), (5,9) # part1
(13, 17), (14, 18)] # part2
My idea is, starting from the beginning:
Fingerprint 2 large chunks* of audio and see if they match:
If yes: move on to the next chunk
If not:
Go down both tracks looking for the first non-silent portion of each
Offset the target to match the original
Go back to the beginning of the loop
# * chunk size determined by heuristics and modifiable
The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.
Ideally I want to them as few times as possible. Ideas?
Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.
That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.
Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.
Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)
Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.
If the chunk matches well enough, see if it matches all the way out to silence.
The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.
The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.
If you can reliably distinguish silence from non-silence as you suggest and if the only differences are insertions of silence, then it seems the only non-trivial case is where silence is inserted where there was none before:
ref: --part1part2--
syn: ---part1---part2----
If you can make your chunk size adaptive to the silence, your algorithm should be fine. That is, if your chunk size is equivalent to two characters in the above example, your algorithm would recognize "pa" matches "pa" and "rt" matches "rt" but for the third chunk it must recognize the silence in syn and adapt the chunk size to compare "1" to "1" instead of "1p" to "1-".
For more complicated edits, you might be able to adapt a weighted Shortest Edit Distance algorithm with removing silence have 0 cost.

avcodec/avformat Number of Frames in File

I need to know how to find the total number of frames in a video file using avcodec/avformat.
I have a project that I'm picking up from someone else who was using the ffmpeg libraries to decode video streams. I need to retrofit some functionality to seek around frame by frame, and so my first task is to simply figure out the total number of frames in the file. AVStream.nb_frames seems to be a reasonable place to look, but this is always 0 with all of the video files I've tried. Is deducing the total number of frames from AVFormatContex.duration the best way to go?
The only way to find the exact number of frames is to go through them all and count. I have worried about this many times, tried many different tools (including ffmpeg), and read a lot. Sorry, but in the general case there's no other way. Some formats just don't store this information, so you have to count.

Resources