Difference between mel-spectrogram and an MFCC - spectrogram

I'm using the librosa library to convert music segments into mel-spectrograms to use as inputs for my neural network, as shown in the docs here.
How is this different from MFCCs, if at all? Are there any advantages or disadvantages to using either?

To get MFCC, compute the DCT on the mel-spectrogram. The mel-spectrogram is often log-scaled before.
MFCC is a very compressible representation, often using just 20 or 13 coefficients instead of 32-64 bands in Mel spectrogram. The MFCC is a bit more decorrelarated, which can be beneficial with linear models like Gaussian Mixture Models. With lots of data and strong classifiers like Convolutional Neural Networks, mel-spectrogram can often perform better.

I suppose, jonnor's answer is not exactly correct. There are two steps:
1. Take logs of Mel spectrogram.
2. Compute DCT on logs.
Moreover, taking logs seems to be "the main part" for training NN: https://qr.ae/TWtPLD

A key difference is that the mel-spectrogram has the semantics of a spectrum, whereas MFCC in a sense is a 'spectrum of a spectrum'. The real question is thus: What is the purpose of applying the DCT to the mel-spectrogram, which has good answers here and there.
Note that in the meantime librosa also has a mfcc function. And looking at its implementation basically confirms that it is
calling melspectrogram,
converting its output to logs (via power_to_db),
taking the DCT of the frequencies, as if they were a signal,
truncating the new 'spectrum of spectrum' after the first n_mfcc coefficients.

Related

Multi-channel Lattice Recursive Least Squares

I'm trying to implement multi-channelt lattice RLS, i.e. the recursive least squares algorithm which performs noise cancellation with multiple inputs, but a single 'desired output'.
I have the basic RLS algorithm working with multiple components, but it's too inefficient and memory intensive for my purpose.
Wikipedia has an excellent example of lattice RLS, which works great.
https://en.wikipedia.org/wiki/Recursive_least_squares_filter
However, the sources it cites do not go into much detail on how to extend this to the multi-channel case, and re-doing the full derivation is a bit beyond me.
Does anyone know a good source which describes or implements this algorithm in the multi-channel case? Many thanks.
Use separate parallel adaptive filters...one for each noise reference and combine these outputs to subtract from your noisy signal. LMS usually works best but RLS is fine. Problems arise if any of the noise references are heavily correlated with the desired signal.

Why should we compute the image mean when we train CNNs?

When I use caffe for image classification, it often computes the image mean. Why is that the case?
Someone said that it can improve the accuracy, but I don't understand why this should be the case.
Refer to image whitening technique in Deep learning. Actually it has been proved that it improve the accuracy but not widely used.
To understand why it helps refer to the idea of normalizing data before applying machine learning method. which helps to keep the data in the same range. Actually there is another method now used in CNN which is Batch normalization.
Neural networks (including CNNs) are models with thousands of parameters which we try to optimize with gradient descent. Those models are able to fit a lot of different functions by having a non-linearity φ at their nodes. Without a non-linear activation function, the network collapses to a linear function in total. This means we need the non-linearity for most interesting problems.
Common choices for φ are the logistic function, tanh or ReLU. All of them have the most interesting region around 0. This is where the gradient either is big enough to learn quickly or where a non-linearity is at all in case of ReLU. Weight initialization schemes like Glorot initialization try to make the network start at a good point for the optimization. Other techniques like Batch Normalization also keep the mean of the nodes input around 0.
So you compute (and subtract) the mean of the image so that the first computing nodes get data which "behaves well". It has a mean of 0 and thus the intuition is that this helps the optimization process.
In theory, a network can be able to "subtract" the mean by itself. So if you train long enough, this should not matter too much. However, depending on the activation function "long enough" can be important.

Artificial neural network image transformation

I have a pairs of images (input-output) but I don't know the transformation to going from A (input) to B (output). I want to record image A and get image B. Physically I can change the setup to get A or B, but I want to do it by software.
If I understood well, a trained Artificial Neural Network is able to do that, having an input can give the corresponding output, is it right?
Is there any software/ANN that just "training" it with entering a number of input-output pairs will be able to provide the correct output if the input is a new (but similar to the others) image?
Thanks
If you have some relevant amount of image pairs (input/output pair) and you don't know transformation between input and output you could train ANN on that training set to imitate that unknown transformation. You will be able to well train your ANN only if you have sufficient amount of training image pairs, but it could be pretty impossible when that unknown transformation is complicated.
For example if that transformation simply increases intensity values of pixels at input image by given value, ANN will very fast learn to imitate that behavior, but if that unknown transformation is some complicated convolution or few serial convolutions or something more complicated it will be very hard, near impossible to train ANN to imitate that transformation. So, more complex transformation will need bigger training set and more complex ANN design.
There are plenty of free opensource ANN libraries implemented in many languages. You could start for example with that tutorial: http://www.codeproject.com/Articles/13091/Artificial-Neural-Networks-made-easy-with-the-FANN
What you are asking is possible in principle -- in theory, an ANN with sufficiently many hidden units can learn an arbitrary function to map inputs to outputs. However, as the comments and other answers have mentioned, there may be many technical issues with your particular problem that could make it impractical. I would classify these problems as (a) mapping complexity, (b) model complexity, (c) scaling complexity, and (d) implementation complexity. They are all somewhat related, but hopefully this is a useful way to break things down.
Mapping complexity
As mentioned by Springfield762, there are many possible functions that map from one image to another image. If the relationship between your input images and your output images is relatively simple -- like increasing the intensity of each pixel by a constant amount -- then an ANN would be able to learn this mapping without much difficulty. There are probably many more transformations that would be similarly easy to learn, such as skewing, flipping, rotating, or translating an image -- basically any affine transformation would be easy to learn. Other, nonlinear transformations could also be feasible, such as squaring the intensity of each pixel.
As a general rule, the more complicated the relationship between your input and output images, the more difficult it will be to get a model to learn this mapping for you.
Model complexity
The more complex the mapping from inputs to outputs, the more complex your ANN model will be to be able to capture this mapping. Models with many hidden layers have been shown in the past 10 years to perform quite well on tasks that people had previously thought impossible, but often these state-of-the-art models have millions or even billions of parameters and take weeks to train on GPU hardware. A simple model can capture many simple mappings, but if you have a complex input-output map to learn, you'll need a large, complex model.
Scaling complexity
Yves mentioned in the comments that it can be difficult to scale models up to typical image sizes. If your images are relatively small (currently the state of the art is to model images on the order of 100x100 pixels), then you can probably just throw a bunch of raw pixel data at an ANN model and see what happens. But if you're using 6000x4000 images from your shiny Nikon DSLR, it's going to be quite difficult to process those in a reasonable amount of time. You'd be better off compressing your image data somehow (PCA is a common technique) and then trying to learn the mapping in the compressed space.
In addition, larger images will have a larger space of possible mappings between them, so you'll need more of your larger images as training data than you would if you had small images.
Springfield762 also mentioned this: If the mapping between your input and output images is simple, then you'll only need a few examples to learn the mapping successfully. But if you have a complicated mapping, then you'll need much more training data to have a chance at learning the mapping properly.
Implementation complexity
It's unlikely that a tool already exists that would let you just throw image data into an ANN model and have a mapping appear. Most likely you'll need, at a minimum, to implement some code that will pre-process your image data. In addition, if you have lots of large images you'll probably need to write code to handle loading data from disk, etc. (There are a lot of "big data" tools for things like this, but they all require some amount of work to get set up.)
There are many, many open source ANN toolkits out there nowadays. FANN (already mentioned) is a popular one in C++ with bindings in other languages. Caffe is quite popular, and is also implemented in C++ with bindings. There seem to be many toolkits that use Python and Theano or some other GPU acceleration library -- Keras, Lasagne, Hebel, Pylearn2, neon, and Theanets (I wrote this one). Many people use Torch, written in Lua. Matlab has at least one neural network toolbox. I'm less familiar with other ecosystems, but Java seems to have Deeplearning4j, C# has Accord, and even R has darch.
But with any of these neural network toolkits, you're going to have to write some code to load the data, process it into the appropriate input format, construct (or load) a network model, train the model, etc.
The problem you're trying to solve is a canonical classification problem that neural networks can help you solve. You treat the B images as a set of labels that you match to A, and once trained, the neural network will be able to match the B images to new input based on where the network locates new input in a high-dimensional vector space. I assume you'd use some combination of convolutional networks to create your features, and softmax for multinomial classification on the output layer. More here: http://deeplearning4j.org/convolutionalnets.html
Since this has been written there has been a lot of work in the realm of cgans ( conditional generative adversarial networks ) please refer to:
https://arxiv.org/pdf/1611.07004.pdf

Bimodal distribution characterization algorithm?

What algorithms can be used to characterize an expected clearly bimodal distribution, say a mixture of 2 normal distributions with well separated peaks, in an array of samples? Something that spits out 2 means, 2 standard deviations, and some sort of robustness estimate, would be the desired result.
I am interested in an algorithm that can be implemented in any programming language (for an embedded controller), not an existing C or Python library or stat package.
Would it be easier if I knew that the two modal means differ by a ratio of approximately 3:1 +- 50%, the standard deviations are "small" relative to the peak separation, but the pair of peaks could be anywhere in a 100:1 range?
There are two separate possibilities here. One is that you have a single distribution that is bimodal. The other is that you are observing data from two different distributions. The usual way to estimate the later is in something called, unsurprisingly, a mixture model.
Your approaches for estimating are to use a maximum likelihood approach or use Markov chain Monte Carlo methods if you want to take a Bayesian view of the problem. If you state your assumptions in a bit more detail I'd be willing to help try and figure out what objective function you'd want to try and maximize.
These type of models can be computationally intensive, so I am not sure you'd want to try and do the whole statistical approach in an embedded controller. A hack might be a better fit. If the peaks are in fact well separated, I think it would be easier to try and identify the two peaks and split your data between them and do the estimation of the mean and standard deviation for each distribution independently.

Perceptual similarity between two audio sequences

I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.
It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a Euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?
(I saw somewhere that Shazam uses hashing, but that's a different problem because there the two pieces of audio being compared are fundamentally the same, but one has more noise. Here, the two pieces of audio are not the same, they are just perceptually similar.)
The process for comparing a set of sounds for similarities is called Content Based Audio Indexing, Retrieval, and Fingerprinting in computer science research.
One method of doing this is to:
Run several bits of signal processing on each audio file to extract features, such as pitch over time, frequency spectrum, autocorrelation, dynamic range, transients, etc.
Put all the features for each audio file into a multi-dimensional array and dump each multi-dimensional array into a database
Use optimization techniques (such as gradient descent) to find the best match for a given audio file in your database of multi-dimensional data.
The trick to making this work well is which features to pick. Doing this automatically and getting good results can be tricky. The guys at Pandora do this really well, and in my opinion they have the best similarity matching around. They encode their vectors by hand though, by having people listen to music and rate them in many different ways. See their Music Genome Project and List of Music Genome Project attributes for more info.
For automatic distance measurements, there are several projects that do stuff like this, including marsysas, MusicBrainz, and EchoNest.
Echonest has one of the simplest APIs I've seen in this space. Very easy to get started.
I'd suggest looking into spectrum analysis. Whilst this isn't as straightforward as you're most likely wanting, I'd expect that decomposing the audio into it's underlying frequencies would provide some very useful data to analyse. Check out this link
Your first step will definitely be taking a Fourier Transform(FT) of the sound waves. If you perform an FT on the data with respect to Frequency over Time1, you'll be able to compare how often certain key frequencies are hit over the course of the noise.
Perhaps you could also subtract one wave from the other, to get a sort of stepwise difference function. Assuming the mock-noise follows the same frequency and pitch trends2 as the original noise, you could calculate the line of best fit to the points of the difference function. Comparing the best fit line against a line of best fit taken of the original sound wave, you could average out a trend line to use as the basis of comparison. Granted, this would be a very loose comparison method.
- 1. hz/ms, perhaps? I'm not familiar with the unit magnitude being worked with here, I generally work in the femto- to nano- range.
- 2. So long as ∀ΔT, ΔPitch/ΔT & ΔFrequency/ΔT are within some tolerance x.
- Edited for formatting, and because I actually forgot to finish writing the full answer.

Resources