I'm looking all over the Internet for information in regards to calculating the frame length and it's been hard... I was able to successfully calculate the frame length in ms of MPEG-4, AAC, using:
frameLengthMs = mSamplingRate/1000
This works since there is one sample per frame on AAC. For MPEG-1 or MPEG-2 I'm confused. There are 1152 samples per frame, ok, so what do I do with that? :P
Frame sample:
MPEGDecoder(23069): mSamplesPerFrame: 1152
MPEGDecoder(23069): mBitrateIndex: 7
MPEGDecoder(23069): mFrameLength: 314
MPEGDecoder(23069): mSamplingRate: 44100
MPEGDecoder(23069): mMpegAudioVersion 3
MPEGDecoder(23069): mLayerDesc 1
MPEGDecoder(23069): mProtectionBit 1
MPEGDecoder(23069): mBitrateIndex 7
MPEGDecoder(23069): mSamplingRateFreqIndex 0
MPEGDecoder(23069): mPaddingBit 1
MPEGDecoder(23069): mPrivateBit 0
MPEGDecoder(23069): mChannelMode 1
MPEGDecoder(23069): mModeExtension 2
MPEGDecoder(23069): mCopyright 0
MPEGDecoder(23069): mOriginal 1
MPEGDecoder(23069): mEmphasis 0
MPEGDecoder(23069): mBitrate: 96kbps
The duration of an MPEG audio frame is a function of the sampling rate and the number of samples per frame. The formula is:
frameTimeMs = (1000/SamplingRate) * SamplesPerFrame
In your case this would be
frameTimeMs = (1000/44100) * 1152
Which yields ~26ms per frame. For a different sampling rate you would get a different duration. The key is MPEG audio always represents a fixed number of samples per frame, but the time duration of each sample is dependent on the sampling rate.
Related
I want to extract mfcc features of an audio file sampled at 8000 Hz with the frame size of 20 ms and of 10 ms overlap. What must be the parameters for librosa.feature.mfcc() function. Does the code written below specify 20ms chunks with 10ms overlap?
import librosa as l
x, sr = l.load('/home/user/Data/Audio/Tracks/Dev/FS_P01_dev_001.wav', sr = 8000)
mfccs = l.feature.mfcc(x, sr=sr, n_mfcc = 24, hop_length = 160)
The audio file is of 1800 seconds. Does that mean I would get 24 mfccs for all (1800/0.01)-1 chunks of the audio?
1800 seconds at 8000 Hz are obviously 1800 * 8000 = 14400000 samples.
If your hop length is 160, you get roughly 14400000 / 160 = 90000 MFCC values with 24 dimensions each. So this is clearly not (1800 / 0.01) - 1 = 179999 (off by a factor of roughly 2).
Note that I used roughly in my calculation, because I only used the hop length and ignored the window length. Hop length is the number of samples the window is moved with each step. How many hops you can fit depends on whether you pad somehow or not. And if you decide not to pad, the number of frames also depends on your window size.
To get back to your question: You have to ask yourself how many samples are 10 ms?
If 1 s contains 8000 samples (that's what 8000 Hz means), how many samples are in 0.01 s? That's 8000 * 0.01 = 80 samples.
This means you have a hop length of 80 samples and a window length of 160 samples (0.02 s—twice as long).
Now you should tell librosa to use this info, like this:
import librosa as l
x, sr = l.load('/home/user/Data/Audio/Tracks/Dev/FS_P01_dev_001.wav', sr = 8000)
n_fft = int(sr * 0.02) # window length: 0.02 s
hop_length = n_fft // 2 # usually one specifies the hop length as a fraction of the window length
mfccs = l.feature.mfcc(x, sr=sr, n_mfcc=24, hop_length=hop_length, n_fft=n_fft)
# check the dimensions
print(mfccs.shape)
Hope this helps.
I am trying to use CoreAudio/AudioToolbox to play multiple MIDI files using different MIDISynth nodes. I have the samplers wired into a MultiChanelMixer which is in turn wired into the IO unit. I want to be able to change the different input volumes independently of one another. I'm attempting this with this line:
AudioUnitSetParameter(mixerUnit, kMultiChannelMixerParam_Volume, kAudioUnitScope_Input, UInt32(trackIndex), volume, 0)
The problem is that adjusting trackIndex 0 adjusts every input coming into the mixer, not just the one bus like I'm expecting it to.
Here is the output from CAShow of the master graph
AudioUnitGraph 0xC590003:
Member Nodes:
node 1: 'auou' 'rioc' 'appl', instance 0x60000002d580 O I
node 2: 'aumx' 'mcmx' 'appl', instance 0x60000002d680 O I
node 3: 'aumu' 'msyn' 'appl', instance 0x60000002db60 O I
node 4: 'aumu' 'msyn' 'appl', instance 0x60000002ef20 O I
node 5: 'aumu' 'msyn' 'appl', instance 0x60000002df00 O I
node 6: 'aumu' 'msyn' 'appl', instance 0x60800022d820 O I
Connections:
node 2 bus 0 => node 1 bus 0 [ 2 ch, 44100 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 3 bus 0 => node 2 bus 0 [ 2 ch, 44100 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 4 bus 0 => node 2 bus 1 [ 2 ch, 44100 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 5 bus 0 => node 2 bus 2 [ 2 ch, 44100 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
node 6 bus 0 => node 2 bus 3 [ 2 ch, 44100 Hz, 'lpcm' (0x00000029) 32-bit little-endian float, deinterleaved]
CurrentState:
mLastUpdateError=0, eventsToProcess=F, isInitialized=T, isRunning=T (2)
Here is the class I wrote to control all of this: https://gist.github.com/jadar/26d9625c875ce91dd2ad0ad63dfd8f80
Mixers are difficult because in a way the channels break the audio stream paradigm that core audio sets up, plus its different for the crosspoints than the input/output masters, and different for the global master.
Based on the code you provided, and assuming you're passing 0 for trackIndex, I would look at the volumes by using the property kAudioUnitProperty_MatrixLevels (which can be tricky to use, so let me know if you need help with that). It's possible that the crosspoint levels are not set correctly and that lowering trackIndex 0 (which is actually bus 0 channel 0) it's affecting everything you hear.
In case it's not clear, the way that the bus/channel paradigm works is it's a contiguous within the mixer itself. So if you have 4 busses with stereo channels and wanted to affect the right channel of bus 0, that would be mixer channel 7.
As I know PCR stored in 42bits and PTS stored in 33bits in mpegts container
So,
Max value for PCR is 2^42 = 4398046511104
Max value for PTS is 2^33 = 8589934592
PCR (sec) = 4398046511104 / 27 000 000 Hz = 162890,6 seconds (45 hours)
PTS (sec) = 8589934592 / 90 000 Hz = 95443,7 seconds (26,5 hours)
So,
what I must to do, if PTS or PCR reach one of this max values ?
This can be happening in iptv for continuous stream
Max value for PCR is 2^42 = 4398046511104
This is not true. Please refer: https://stackoverflow.com/a/36810049/6244249
Just let it overflow and continue to write the low 33 bits. The demuxer will know how to handle it.
This was a midterm question and I do not know how to calculate this.
A CD quality stereo song has been saved on your computer, occupying 35.28
MBytes of storage. The CD quality mandates that we have 16-bit quantization as
well as a uniform sampling of 44.1 KHz (samples/second). Find the duration of
this song (Hint: 1 Bytes=8 Bits).
Hint: follow the units.
Edit: forgot about stereo.
(total size = bits) / (channels) / (quantization = bits/sample) / (sampling rate = samples/second)
= 35.28 MB / (2 channels for stereo) (16 bits/sample) / (44.1 k samples/second) = (35.28 * 8000000) / (16 * 44100)
= x seconds (assuming MB = 8000000 bits, not MiB)
I seen the following explanantion for motion estimation / compensation for MPEG 1 and was just wondering is it correct:
Why dont we just code the raw difference between the current block and the reference block?
Because the numbers for the residual are usually going to be a lot smaller. For example, say an object accelerates across the image. The x position in 11 frames was the following numbers.
12 16 20 25 31 38 48 59 72 84 96
The raw differences would be
x 4 4 5 6 7 10 11 13 12 12
So the predicted values would be
x x 20 24 30 37 45 58 70 85 96
So the residuals are
x x 0 1 1 1 3 1 2 -1 0
Is the prediction for frame[i+1] = (frame[i] - frame[i-1]) + frame[i] i.e add the motion vector of previous two reference frames to the most recent reference frame? Then we encode the prediction residual, which is actual captured shot of frame[i+1] - prediction frame[i+1] and send this to the decoder?
MPEG1 decoding (motion compensation) works like this:
The predictions and motion vectors turn a reference frame into the next (current) frame. Here's how you would calculate each pixel of the new frame:
For each macroblock, you have a set of predicted values (differences from reference frame). The motion vector is a value relative to the reference frame.
// Each luma and chroma block are 8x8 pixels
for(y=0; y<8; y++)
{
for (x=0; x<8; x++)
{
NewPixel(x,y) = Prediction(x,y) + RefPixel(x+motion_vector_x, y+motion_vector_y)
}
}
With MPEG1 you have I, P and B frames. I frames are completely intra coded (e.g. similar to JPEG), with no references to other frames. P frames are coded with predictions from the previous frame (either I or P). B frames are coded with predictions from both directions (previous and next frame). The B frame processing makes the video player a little more complicated because it may reference the next frame, therefore each frame has a sequence number and B frames will cause the sequence to be non-linear. In other words, your video decoder needs to hold on to potentially 3 frames while decoding a stream (previous, current and next).