Perceptual similarity between two audio sequences - algorithm

I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.
It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a Euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?
(I saw somewhere that Shazam uses hashing, but that's a different problem because there the two pieces of audio being compared are fundamentally the same, but one has more noise. Here, the two pieces of audio are not the same, they are just perceptually similar.)

The process for comparing a set of sounds for similarities is called Content Based Audio Indexing, Retrieval, and Fingerprinting in computer science research.
One method of doing this is to:
Run several bits of signal processing on each audio file to extract features, such as pitch over time, frequency spectrum, autocorrelation, dynamic range, transients, etc.
Put all the features for each audio file into a multi-dimensional array and dump each multi-dimensional array into a database
Use optimization techniques (such as gradient descent) to find the best match for a given audio file in your database of multi-dimensional data.
The trick to making this work well is which features to pick. Doing this automatically and getting good results can be tricky. The guys at Pandora do this really well, and in my opinion they have the best similarity matching around. They encode their vectors by hand though, by having people listen to music and rate them in many different ways. See their Music Genome Project and List of Music Genome Project attributes for more info.
For automatic distance measurements, there are several projects that do stuff like this, including marsysas, MusicBrainz, and EchoNest.
Echonest has one of the simplest APIs I've seen in this space. Very easy to get started.

I'd suggest looking into spectrum analysis. Whilst this isn't as straightforward as you're most likely wanting, I'd expect that decomposing the audio into it's underlying frequencies would provide some very useful data to analyse. Check out this link

Your first step will definitely be taking a Fourier Transform(FT) of the sound waves. If you perform an FT on the data with respect to Frequency over Time1, you'll be able to compare how often certain key frequencies are hit over the course of the noise.
Perhaps you could also subtract one wave from the other, to get a sort of stepwise difference function. Assuming the mock-noise follows the same frequency and pitch trends2 as the original noise, you could calculate the line of best fit to the points of the difference function. Comparing the best fit line against a line of best fit taken of the original sound wave, you could average out a trend line to use as the basis of comparison. Granted, this would be a very loose comparison method.
- 1. hz/ms, perhaps? I'm not familiar with the unit magnitude being worked with here, I generally work in the femto- to nano- range.
- 2. So long as ∀ΔT, ΔPitch/ΔT & ΔFrequency/ΔT are within some tolerance x.
- Edited for formatting, and because I actually forgot to finish writing the full answer.


Algorithm suggestion: comparing sound clips

(Not sure if this is the right place for this question)
We are analyzing thousands of sound clips of people talking in an attempt to find patterns in the pitch, syllable rate, etc. in order to come up with a signature database to match new sound bites to emotions.
While I am familiar with some AI algorithms (Bayes, for instance) I'm curious if anyone has any ideas on the types of algorithms we could employ.
Overall concept (figure short 2-5 second .wav clips):
soundClip1 -> 'anger'
soundClip2 -> 'happy'
soundClip3 -> 'sad'
emotion = predict(newSoundClip)
Given a new sound clip, we would like to do something similar to Shazzam except for returning a probability that the clip represents a particular emotion.
Any suggestions would be appreciated!
Try to normalize the clips in terms of their amplitude and frequency to make them comparable.
Then measure amplitude and spectral properties like variance, autocorrelation, number of minima/maxima, etc.
These measurements allow to view each clip as a vector in an n-dimensional space. You can use cluster analysis methods to find neighbored clips. Principal component analysis (PCA) might help to find more or less meaningful property dimensions.
It takes a lot of reading pattern recognition, signal processing and cluster analysis texts to get to know what is possible.

How to detect how similar a speech recording is to another speech recording?

I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:
I record myself saying "Good morning"
I let a foreign student record "Good morning"
Compare his recording to mine to see if his pronunciation was good enough.
I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?
A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.
I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.
Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.
Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.
Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.
Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.
Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)
Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.
You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.
Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.
Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.
You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.
It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.
Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.
I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!
The musicg api
has a audio fingerprint generator and scorer
along with source code to show how its done.
I think it looks for the most similar point in each track, then scores based on how far it can match.
It might look something like
import com.musicg.wave.Wave
double score =
new FingerprintsSimilarity(
new Wave("voice1.wav").getFingerprint(),
new Wave("voice2.wav").getFingerprint() ).getSimilarity();
The way biotechnologists align two protein sequences is as follows: Each sequence is represented as a string on an alphabet as(A/C/G/T - these are different types of proteins, irrelevant for us), where each letter (here, an entry) represents a particular amino acid. The quality of an alignment (its score) is calculated from the similarity of each pair of corresponding entries, and the number and length of the blank entries that need to be inserted to produce that alignment.
Same algorithm ( can be used for pronunciation, from substitution frequencies in a set of alternate pronunciations. Then you can calculate alignment scores to measure the similarity between the two pronunciations in a way that is sensitive to the differences between phonemes. Measures of similarity that can be used here are Levenshtein distance, phoneme error rate, and word error rate.
The minimum number of insertions, deletions and substitutions required for transformation of one sequence into another is the Levenshtein distance. More info at
Phoneme error rate (PER) is the Levenshtein distance between a predicted pronunciation and the reference pronunciation, divided by the number of phonemes in the reference pronunciation.
Word error rate (WER) is the proportion of predicted pronunciations with at least one phoneme error to the total number of pronunciations.
Source: Did an Internship on this at UW-Madison
A carefully configured Levenshtein distance should do the trick.
I know this question is out of date but...
To solve a similar problem I used Google Speech Recognized API to check WHAT was said and visual compare scaled wave forms of volume changes to detect differences in rhythm.
Code & video of the result.
you can use Musicg as roy zhang suggested. In android, just include musicg jar file in your android project and use it. A tested example:
import com.musicg.wave.Wave;
import com.musicg.fingerprint.FingerprintSimilarity;
//somewhere in your code add
String file1 = Environment.getExternalStorageDirectory().getAbsolutePath();
file1 += "/test.wav";
String file2 = Environment.getExternalStorageDirectory().getAbsolutePath();
file2 += "/test.wav";
Wave w1 = new Wave(file1);
Wave w2 = new Wave(file2);
FingerprintSimilarity fps = w1.getFingerprintSimilarity(w2);
float score = fps.getScore();
float sim = fps.getSimilarity();
Log.d("score", score+"");
Log.d("similarities", sim+"");
Good luck
If this is only to check the pronunciation [of course with different accent], you can do this :
Step 1 : Using some voice tool [say dragon dictation], you can have the text with you.
Step 2 : Compare the string or the word formed and compare it with the string that actually was meant to be pronounced.
Step 3 : If you find any discrepancy in the strings, means the word was not spelled correctly. And you can suggest the correct pronunciation.
You have to look into speech recognition algorithms. I understand that you don't need to translate speech to text (that is done by speech recognition algorithms), however, in your case many algorithms would be the same.
Probably, HMM would be helpful here (hidden markov models).
Also look into here:

Detecting wind noise [closed]

I want to develop an app for detecting wind according the audio stream.
I need some expert thoughts here, just to give me guide lines or some links, I know this is not easy task but I am planning to put a lot of effort here.
My plan is to detect some common patterns in the stream, and if the values are close to this common patterns of the wind noise I will notify that match is found, if the values are closer to the known pattern great, I can be sure that the wind is detected, if the values doesn't match with the patterns then I guess there is no so much wind....
That is my plan at first, but I need to learn how this things are done. Is there some open project already doing this ? Or is there someone who is doing research on this topics ?
The reason I write on this forum is because I do not know how to google it, the things I found was not I was looking for. I really do not know how to start developing this kind of algorithm.
EDIT 1 :
I tried to record a wind, and when I open the saved audio file for me it was just a bunch of numbers :). I do not even see in what format should I save this, is wave good enough ? Should I use something else, or what if I convert the wind noise audio file in mp3 : is this gonna help with parsing ?
Well I got many questions, that is because I do not know from where to read more about this kind of topic. I tag my question with guidlines so I hope someone will help me.
There must be something that is detectable, cause the wind noise is so common, there must be somehow to detect this, we need only someone to give me tips, someone who is familiar with this topic.
I just came across this post I have recently made a library which can detect wind noise in recordings.
I made a model of wind noise and created a database of examples and then trained a Machine Learning algorithm to detect and meter the wind level in a perceptually weighted way.
The C++/C code is here if it is of use to anyone!
The science for your problem is called "pattern classification", especially the subfield of "audio pattern classification". The task is abstracted as classifying a sound recording into two classes (wind and not wind). You seem to have no strong background in signal processing yet, so let me insert one central warning:
Pattern classification is not as easy as it looks at first. Humans excel at pattern classification. Computers don't.
A good first approach is often to compute the correlation of the Fourier transform of your signal and a sample. Don't know how much that will depend on wind speed, however.
You might want to have a look at the bag-of-frames approach, it was used successfully to classify ambient noise.
As #thiton mentioned this is an example of audio pattern classification.
Main characteristics for wind: it's a shaped (band/hp filtered) white noise with small semi-random fluctuations in amplitude and pitch. At least that's how most synthesizers reproduce it and it sounds quite convincing.
You have to check the spectral content and change in the wavefile, so you'll need FFT. Input format doesn't really matter, but obviously raw material (wav) is better.
Once you got that you should detect that it's close to some kind of colored noise and then perhaps extract series of pitch and amplitude and try to use classic pattern classification algorithm for that data set. I think supervised learning could work here.
This is actually a hard problem to solve.
Assuming you have only a single microphone data. The raw data you get when you open an audio file (time-domain signal) has some, but not a lot of information for this kind of processing. You need to go into the frequency domain using FFTs and look at the statistics of the the frequency bins and use that to build a classifier using SVM or Random Forests.
With all due respect to #Karoly-Horvath, I would also not use any recordings that has undergone compression, such as mp3. Audio compression algorithms always distorts the higher frequencies, which as it turns out, is an important feature in detecting wind now. If possible, get the raw PCM data from a mic.
You also need to make sure your recording is sampled at at least 24kHz so you have information of the signal up to 12kHz.
Finally - the wind shape in the frequency domain is not a simple filtered white noise. The characteristics is that it usually has high energy in the low frequencies (a rumbling type of sound) with sheering and flapping sounds in the high frequencies. The high frequency energy is quite transient, so if your FFT size is too big, you will miss this important feature.
If you have 2 microphone data, then this gets a little bit easier. Wind, when recorded, is a local phenomenon. Sure, in recordings, you can hear the rustling of leaves or the sound of chimes caused by the wind. But that is not wind-noise and should not be filtered out.
The actual annoying wind noise you hear in a recording is the air hitting the membrane of your microphone. That effect is a local event - and can be exploited if you have 2 microphones. It can be exploited because the event is local to each individual mic and is not correlated with the other mic. Of course, where the 2 mics are placed in relations to each other is also important. They have to be reasonably close to each other (say, within 8 inches).
A time-domain correlation can then be used to determine the presence of wind noise. (All the other recorded sound are correlated with each other because the mics are fairly close to each other, so a high correlation means no wind, low correlation means wind). If you are going with this approach, your input audio file need not be uncompressed. A reasonable compression algorithm won't affect this.
I hope this overview helps.

Comparing 2 one dimensional signals

I have the following problem: I have 2 signals over time. They are from the same source so they should be the same. I want to check if they really are.
they may be measured with different sample rates
the start / end time do not correlate. The measurement does not start at the same time and end at the same time.
there may be an time offset between the two signals.
My thoughts go along Fourier transformation, convolution and statistical methods for comparison. Can someone post me some links where I can find more information on how to handle this?
You can easily correct for the phase by just shifting them so their centers of mass line up. (Or alternatively, in the Fourier domain just multiplying by the inverse of the phase of the first coefficient.)
Similarly, if you want to line up the images given only partial data, you can just cross correlate and take the maximal value (which is again easy to do in the Fourier domain).
That leaves the only tricky part of this process as dealing with the sampling rates. Now if you know a-priori what the sample rates are, (and if they are related by a rational number), you can just use sinc interpolation/downsampling to rescale them to a common sampling rate:
If you don't know the sampling rate, you may be a bit screwed. Technically, you can try just brute forcing over all the different rescalings of your signal, but doing this tends to be either slow or else give mediocre results.
As a last suggestion, if you just want to match sounds exactly you can try using the cepstrum and verifying that the peaks of the signal are close enough to within some tolerance. This type of analysis is used a lot in sound and speech recognition, with some refinements to make it operate a bit more locally. It tends to work best with frequency modulated data like speech and music:
Fourier transformation does sound like the right way.
There is too much mathematical information for me to just start explaining here so if you really wanna know what's going on with that (cause I don't think you can just use FT without understanding it) you should use this reference from MIT OpenCourseWare:
Hope it helped.
If you are working with a linux box and the waveforms that need to be processed have already been recorded, you can try to use the file command to display details about the recording. It gives you the sampling rate when it is invoked on a wav file, though I am not sure what format you are recording in.
If the signals are time-shifted with respect to each other, you may try to convolve one with a delta function with increasing delays and then comparing. On MATLAB, conv and all should be good enough.
These are just 'crude' attempts (almost like hacking at the problem). There may be algorithms that are shift-invariant that may do a better job.
Hope that helps.

FFT Algorithm: What goes IN/OUT? (re: real-time pitch detection)

I am attempting to extract pitch data from an audio stream. From what I can see, it looks as though FFT is the best algorithm to use.
Rather than digging straight into the math, could someone help me understand what this FFT algorithm does?
Please don't say something obvious like 'FFT extracts frequency data from a raw signal.' I need the next level of detail.
What do I pass in, and what do I get out?
Once I understand the interface clearly, this will help me to understand the implementation.
I take it I need to pass in an audio buffer, I need to tell it how many bytes to use for each computation (say the most recent 1024 bytes from this buffer). and maybe I need to specify the range of pitches I want it to detect. Now it is going to pass back what? An array of frequency bins? What are these?
(Edit:) I have found a C++ algorithm to use (if I can only understand it)
Performous extracts pitch from the microphone. Also the code is open source. Here is a description of what the algorithm does, from the guy that coded it.
PCM input (with buffering)
FFT (1024 samples at a time, remove 200 samples from front of the buffer afterwards)
Reassignment method (against the previous FFT that was 200 samples earlier)
Filtering of peaks (this part could be done much better or even left out)
Combining peaks into sets of harmonics (we call the combination a tone)
Temporal filtering of tones (update the set of tones detected earlier instead of simply using the newly detected ones)
Pick the best vocal tone (frequency limits, weighting, could use the harmonic array also but I don't think we do)
But could someone help me understand how this works? What is it that is getting sent from the FFT to the Reassignment method?
The FFT is just one building block in the process, and it may not be the best approach for pitch detection. Read up on pitch detection and decide which algo you want to use first (this will depend on what exactly you are trying to measure the pitch of - speech, single musical instrument, other types of sound, etc. Get this right before getting into low level details such as the FFT (some, but not all pitch detection algorithms use the FFT internally).
There are numerous similar questions on SO already, e.g. Real-time pitch detection using FFT and Pitch detection using FFT for trumpet, and there is good overview material on Wikipedia etc - read these and then decide whether you still want to roll your own FFT-based solution or perhaps use an existing library which is suitable for your particular application.
There is an element of choice here. The most straightforward to implement is to do (2^n samples in) complex numbers in, and 2^n complex numbers out, so maybe you should start with that.
In the special case of a DCT(discrete cosine transform), typically what goes in is 2^n samples (often floats), and out go 2^n values, often floats too. DCT is an FFT but that takes only the real values, and analyses the function in terms of cosines.
It is smart (but commonly skipped) to define a struct to handle the complex values. Traditionally FFT's are done in-place, but it works fine if you don't.
It can be useful to instantiate a class that contains a work buffer for the FFT (if you don't want to do the FFT in-place), and reuse that for several FFTs.
In goes N samples of PCM (purely real complex numbers). Out comes N bins of frequency domain (each bin corresponding to a 1/N slice of sample rate). Each bin is a complex number. Rather than real and imaginary parts, these values should generally be handled in polar format (absolute value and argument). The absolute value tells the amount of sound near the bin center frequency while the argument tells the phase (at which position the sine wave is travelling).
Most often coders only use the magnitude (absolute value) and throw away the phase angle (argument).
