What are the components of the Mel mfcc - librosa

In looking at the output of this line of code:
mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
print("MFCC Shape = ", mfccs.shape)
I get a response of MFCC Shape = (40,1876). What do these two numbers represent? I looked at the librosa website but still could not decipher what are these two values.
Any insights will be greatly appreciated!

The first dimension (40) is the number of MFCC coefficients, and the second dimensions (1876) is the number of time frames. The number of MFCC is specified by n_mfcc, and the number of time frames is given by the length of the audio (in samples) divided by the hop_length.
To understand the meaning of the MFCCs themselves, you should understand the steps it takes to compute them:
Spectrograms, using the Short-Time-Fourier-Transform (STFT)
The Mel spectrogram, from applying Mel scale filterbanks to the STFT
Mel Frequency Cepstral Coefficients, from applying the DCT transform on the mel-spectrogram.
A good written explainer is Haytham Fayek: Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between
and a good video explainer is The Sound of AI: Mel-Frequency Cepstral Coefficients Explained Easily.

Related

How can we divide our long-time domain signal into equal segments and then apply Wavelet transform?

I have a time-domain signal and the samples size is 80000. I want to divide these samples into equal sizes of segments and want to apply wavelet transform to them.
How I can do this step. please guide me.
Thank you
One way to segment your original data is simply to use numpy's reshape function.
Assuming that you want to reshape your data into 2000 samples long segments:
import numpy as np
original_time_series = np.random.random(80000)
window_size = 2000
reshaped_time_series = original_time_series.reshape((window_size,-1))
Of course, you will need to ensure that the total number of samples in your time series is a multiple of the window_size. Otherwise, you can trim your input time series to match this requirement.
You can then apply your wavelet transform to each and every segment in your reshaped array.
The previous answer assumes that you want non-overlapping segments. Depending on what you are trying to achieve, you may prefer using a striding -or sliding- window (e.g. with a 50% overlap). This questions is already covered in detail here.

Reducing one frequency in song

How would I take a song input and output the same song without certain frequencies?
Based on my research so far, the song should be broken down into chucks, FFT it, reduce the target frequencies, iFFT it, and stitch the chunks back together. However, I am unsure if this is the right approach to take, and if so, how I would convert from the audio to FFT input (what seems to be a vector matrix), how to reduce the target frequencies, and how to convert back from the FFT output to audio, and how to restitch.
For background, my grandpa loves music. However, recently he cannot listen to it, as he has become hypersensitive to certain frequencies within songs. I'm a high school student who has some background coding, and am just getting into algorithmic work and thus have very little experience using these algorithms. Please excuse me if these are basic questions; any pointers would be helpful.
EDIT: So far, I've understood the basic premise of fft (through basic 3blue1brown yt videos and the like) and that it is available through scipy/numpi, and figured out how to convert from youtube to 0.25 second chunks in a wav format.
Your approach is right.
Concerning subquestions:
from the audio to FFT input - assign audio samples to real part of complex signal, imaginary part is zero
how to reduce the target frequencies - multiply FFT results near needed frequency by smooth (to diminish artifacts) function that becomes zero at that freqency and reaches 1.0 some samples apart
how to convert back - just make inverse FFT (don't forget about scale multiplier like 1/N) and copy real part into audio channel
Also consider using of simpler digital filtering - band-stop or notch filter.
Arbitrary found example1 example2
(calculation of parameters perhaps requires some understanding of DSP)

How to quantitatively measure the diversity of a set of images

I'm trying to measure the diversity of a set of images. I'm defining diversity as a qualitative measure of the overall amount of difference in a set of images, so a set of identical images has a diversity of 0.
So far, the approach I thought of is to take the average intensity of every pixel in the set, that will give an "average" image for the set. Then use the pixels in the "average" image to calculate the standard deviation for intensity of every pixel, creating a matrix of standard deviation values for every pixel. Then I can take the matrix norm of the standard deviation matrix- larger norms imply more diversity.
Another post (linked below) suggests that, to measure how closely an image is to a set of image, one can create a classifier and see with what tolerance value the new image can be accepted. This measures how closely one image matches a set of images, and doesn't measure the diversity of the set (unless it's performed many times, but I'm not sure how that would work).
Is there a better way of measuring the diversity of a set of images than just by taking the matrix norm of the standard deviation of every pixel? Any info is appreciated. Thank you!
Posts referenced:
Measuring how a new sample contributes to the diversity of a dataset
Clustering of images to evaluate diversity (Weka?)

Maximum frequency present in an image in MATLAB

I guess there are various forms of this question peresnt here in stackoverflow. But I was unable to understand how I can solve my problem.
I have an image and I want to find the frequency content of the image.
img = imread('test.tif');
img = rgb2gray(img);
[N M] = size(img);
%% Compute power spectrum
imgf = fftshift(fft2(img));
imgfp = (abs(imgf)/(N*M)).^2;
I know I have to use the fft for this purpose. But I was wondering if I can find the maximum frequency in the images in terms of a particular value, say 'x cycles/mm' or 'x cycles/inch'.
What would be the best way to do this?
Thank you.
The FFT returns data in an array, where each array element is somewhat related to cycles per total data width (or height, etc.). So you could divide each FFT bin number by the image size in some dimensional unit (say "inches") to get cycles per unit dimension (say cycles per inch).
Note that except for some very specific narrowly specified types of images (say, constant amplitude exactly aperture periodic sinusoidal gradients), any image content will get spattered across the entire frequency spectrum and range of the FFT result. So you will likely have to set some non-zero threshold for frequency content before you can limit your "maximum" frequency finding.

Anti-aliasing: Preferred ways of determing maximum frequency?

I've been reading up a bit on anti-aliasing and it seems to make sense, but there is one thing I'm not too sure of. How exactly do you find the maximum frequency of a signal (in the context of graphics).
I realize there's more than one case so I assume there is more than one answer. But first let me state a simple algorithm that I think would represent maximum frequency so someone can tell me if I'm conceptualizing it the wrong way.
Let's say it's for a 1 dimensional,finite, and greyscale image (in pixels). Am I correct in assuming you could simply scan the entire pixel line (in the spatial domain) looking for a for the minimum oscillation and the inverse of that smallest oscillation would be the maximum frequency?
Ex values {23,26,28,22,48,49,51,49}
Frequency:Pertaining to Set {}
(1/2) = .5 : {28,22}
(1/4) = .25 : {22,48,49,51}
So would .5 be the maximum frequency?
And what would be the ideal way to calculate this for a similar pixel line as the one above?
And on a more theoretical note, what if your sampling input was infinite (more like the real world)? Would a valid process be sort of like:
Predetermine a decent interval for point sampling
Determine max frequency from point sampling
while(2*maxFrequency > pointSamplingInterval)
{
pointSamplingInterval*=2
Redetermine maxFrequency from point sampling (with new interval)
}
I know these algorithms are fraught with inefficiencies, so what are some of the preferred ways? (Not looking for something crazy-optimized, just fundamentally better concepts)
The proper way to approach this is using a Fourier Transform (in practice, an FFT,or fast fourier transform)
The theory works as follows: if you have an set of pixels with color/grayscale, then we can say that the image is represented by pixels in the "spatial domain"; that is, each individual number specifies the image at a particular spatial location.
However, what we really want is a representation of the image in the "frequency domain". Instead of each individual number specifying each pixel, each number represents the amplitude of a particular frequency in the image as a whole.
The tool which converts from the "spatial domain" to the "frequency domain" is the Fourier Transform. The output of the FT will be a sequence of numbers specifying the relative contribution of different frequencies.
In order to find the maximum frequency, you perform the FT, and look at the amplitudes that you get for the high frequencies - then it is just a matter of searching from the highest frequency down until you hit your "minimum significant amplitude" threshold.
You can code your own FFT, but it is much easier in practice to use a pre-packaged library such as FFTW
You don't scan a signal for the highest frequency and then choose your sampling frequency: You choose a sampling frequency that's high enough to capture the things you want to capture, and then you filter the signal to remove high frequencies. You throw away everything higher than half the sampling rate before you sample it.
Am I correct in assuming you could
simply scan the entire pixel line (in
the spatial domain) looking for a for
the minimum oscillation and the
inverse of that smallest oscillation
would be the maximum frequency?
If you have a line of pixels, then the sampling is already done. It's too late to apply an antialiasing filter. The highest frequency that could be present is half the sampling frequency ("1/2px", I guess).
And on a more theoretical note, what
if your sampling input was infinite
(more like the real world)?
Yes, that's when you use the filter. First, you have a continuous function, like a real-life image (infinite sampling rate). Then you filter it to remove everything above fs/2, then you sample it at fs (digitize the image into pixels). Cameras don't actually do any filtering, which is why you get Moire patterns when you photograph bricks, etc.
If you're anti-aliasing computer graphics, you have to think of the ideal continuous mathematical function first, and think through how you would filter it and digitize it to produce the output on the screen.
For instance, if you want to generate a square wave with a computer, you can't just naively alternate between maximum and minimum values. That would be just like sampling a real life signal without filtering first. The higher harmonics wrap back into the baseband and cause lots of spurious spikes in the spectrum. You need to generate points as if they were sampled from a filtered continuous mathematical function:
I think this article from the O'Reilly site might also be useful to you ... http://www.onlamp.com/pub/a/python/2001/01/31/numerically.html ... in there they're referring to frequency analysis of sound files but you it gives you the idea.
I think what you need is an application of Fourier Analysis (http://en.wikipedia.org/wiki/Fourier_analysis). I've studied this but never used it so take it with a pinch of salt but I believe if you apply it correctly to your set of numbers you will get a set of frequencies which are components of the series and then you can pick off the highest one.
I can't point you at a piece of code that does this but I'm sure it would be out there somewhere .

Resources