Drum sound recognition algorithms - algorithm

I am thinking of trying to make program that will automatically generate drum tabs using an audio file containing only the drumming.
I have thought of using FFT to get an average spectrum peaks during a xxxx ms interval and then compare that to a table containing all the drum parts(snare, tombs, base drum and so no) of that specific drum kit and sound gear.
But i have a feeling that it won't be that easy. Have you guys any suggestions on which methods i could use to solve my issue?
It isn't easy for anything except a trivial signal. Almost all western 'classical' and commercial music features coincident drum sounds.
1: Superposition: The original sources add together in a similar manner in the frequency domain as they do in the time domain. Each FFT bin contains contributions from all instruments currently being played (and those which are undamped and still decaying, or resonating sympathetically). Unpicking the various sources is hard - and certainly not a comparison with a library of spectra.
2: The FFT by its definition windows audio in the time domain and yields magnitude and phase of the basis function in each bin over that window period. The best you could say is that content appeared in the bin corresponding to a drum sound within the window period. If you were to compute a 1024 point FFT, the window duration would be 23ms at 44.1kHz. To put this into a musical perspective, 16th notes at 120bpm are 31.3ms apart. You might get away with smaller FFTs.
3: Percussion instrument signals tend to look a lot like noise - at least at the point where the instrument is hit. That is to say, there will be energy spread across the spectrum and no obviously dominant frequencies. After impact, tuned percussion starts to look more 'tonal'.
You probably need to look at a time-domain approach to accurately detect the onset point (onset detection). From there you could look at time or frequency domain characteristics of the signal to try and deduce the instrument in question. There's probably also a lot you could do with a priori knowledge of the genre of music being played, allowing you to predict the patterns that are likely to be present.
This is a particular case of the more generalised audio source separation problem. There's been lots of academic activity in this area, and consequently a lot of published papers describing approaches. Look for source separation, music information retrieval, audio feature detection


Is GPS inaccuracy consistent over short time spans?

I'm interested in developing a semi-autonomous RC lawnmower.
That is, the operator would decide when to stop, turn, etc., but could request "slightly overlap previous cut" and the mower would automatically do so. (Having operated high-end RC mowers at trade shows, this is the tedious part. Overcoming that, plus the high cost -- which I believe is possible -- would make a commercial success.)
This feature would require accurate horizontal positioning. I have investigated ultrasonic, laser, optical, and GPS. Each has its problems in this application. (I'll resist the temptation to go off on these tangents here.)
So... my question...
I know GPS horizontal accuracy is only 3-4m. Not good enough, but:
I don't need to know where I am on the planet. I only need to know where I am relative to where I was a minute ago.
So, my question is, is the inaccuracy consistent in the short term? if so, I think it would work for me. If it varies wildly by +- 1.5m from one second to the next, then it will not work.
I have tried to find this information but have had no success (possibly because of the ubiquity of other GPS-accuracy discussion), so I appreciate any guidance.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Edit ~~~~~~~~~~~~~~~~~~~~~~
It's looking to me like GPS is not just skewed but granular. I'd be interested in hearing from anyone who can give better insight into this, but for now I'm going to explore other options.
I realized that even though my intended application is "outdoor", this question is technically in the field of "indoor positioning systems" so I am adding that tag.
My latest thinking is to have 3 "intelligent" high-dB ultrasonic (US) speaker units. The mower emits RF requests for a tone from each speaker in rapid sequence, measuring the time it takes to "hear" each unit's response, thereby calculating distance to each of these fixed point and using trilateration to get position. if the fixed-point speakers are 300' away from the mower, the mower may have moved several feet between the 1st and 3rd response, so this would have to be allowed for in the software. If it is possible to differentiate 3 different US frequencies, they could be requested/received "simultaneously". Though you still run into issues when you're close to one fixed unit and far from another. So some software correction may still be necessary. If we can assume the mower is moving in a straight line, this isn't too complicated.
Another variation is the mower does not request the tones. The fixed units send RF "here comes tone from unit A" etc., and the mower unit just monitors both RF info and US tones. This may simplify things somewhat, but it seems it really requires the ability to determine which speaker a tone is coming from.
This seems like the kind of thing you could (and should) measure empirically. Just set a GPS of your liking down in the middle of a field on a clear day and wait an hour. Then come back and see what you find.
Because I'm in a city, I can't run out and do this for you. However, I found a paper entitled iGeoTrans – A novel iOS application for GPS positioning in geosciences.
That includes this figure which duplicates the test I propose. You'll note that both the iPhone4 and Garmin eTrex10 perform pretty poorly versus the accuracy you say you need.
But the authors do some Math Magic™ to reduce the uncertainty in the position, presumably by using some kind of averaging. That gets them to a 3.53m RMSE measure.
If you have real-time differential GPS, you can do better. But this requires relatively expensive hardware and software.
Even aside from the above, you have the potential issue of GPS reflection and multipath error. What if your mower has to go under a deck, or thick trees, or near the wall of a house? These common yard features will likely break the assumptions needed to make a good averaging algorithm work and even frustrate attempts at DGPS by blocking critical signals.
To my mind, this seems like a computer vision problem. And not just because that'll give you more accurate row overlaps... you definitely don't want to run over a dog!
In my opinion a standard GPS is no way accurate enough for this application. A typical consumer grade receiver that I have used has a position accuracy defined as a CEP of 2.5 metres. This means that for a stationary receiver in a "perfect" sky view environment over time 50% of the position fixes will lie within a circle with a radius of 2.5 metres. If you look at the position that the receiver reports it appears to wander at random around the true position sometimes moving a number of metres away from its true location. When I have monitored the position data from a number of stationary units that I have used they could appear to be moving at speeds of up to 0.5 metres per second. In your application this would mean that the lawnmower could be out of position by some not insignificant distance (with disastrous consequences for your prized flowerbeds).
There is a way that this can be done, as has been proved by the tractor manufacturers who can position the seed drills and agricultural sprayers to millimetre accuracy. These systems use Differential GPS where there is a fixed reference station positioned in the neighbourhood of the tractor being controlled. This reference station transmits error corrections to the mobile unit allowing it to correct its reported position to a high degree of accuracy. Unfortunately this sort of positioning system is very expensive.

Detecting wind noise [closed]

I want to develop an app for detecting wind according the audio stream.
I need some expert thoughts here, just to give me guide lines or some links, I know this is not easy task but I am planning to put a lot of effort here.
My plan is to detect some common patterns in the stream, and if the values are close to this common patterns of the wind noise I will notify that match is found, if the values are closer to the known pattern great, I can be sure that the wind is detected, if the values doesn't match with the patterns then I guess there is no so much wind....
That is my plan at first, but I need to learn how this things are done. Is there some open project already doing this ? Or is there someone who is doing research on this topics ?
The reason I write on this forum is because I do not know how to google it, the things I found was not I was looking for. I really do not know how to start developing this kind of algorithm.
EDIT 1 :
I tried to record a wind, and when I open the saved audio file for me it was just a bunch of numbers :). I do not even see in what format should I save this, is wave good enough ? Should I use something else, or what if I convert the wind noise audio file in mp3 : is this gonna help with parsing ?
Well I got many questions, that is because I do not know from where to read more about this kind of topic. I tag my question with guidlines so I hope someone will help me.
There must be something that is detectable, cause the wind noise is so common, there must be somehow to detect this, we need only someone to give me tips, someone who is familiar with this topic.
I just came across this post I have recently made a library which can detect wind noise in recordings.
I made a model of wind noise and created a database of examples and then trained a Machine Learning algorithm to detect and meter the wind level in a perceptually weighted way.
The C++/C code is here if it is of use to anyone!
The science for your problem is called "pattern classification", especially the subfield of "audio pattern classification". The task is abstracted as classifying a sound recording into two classes (wind and not wind). You seem to have no strong background in signal processing yet, so let me insert one central warning:
Pattern classification is not as easy as it looks at first. Humans excel at pattern classification. Computers don't.
A good first approach is often to compute the correlation of the Fourier transform of your signal and a sample. Don't know how much that will depend on wind speed, however.
You might want to have a look at the bag-of-frames approach, it was used successfully to classify ambient noise.
As #thiton mentioned this is an example of audio pattern classification.
Main characteristics for wind: it's a shaped (band/hp filtered) white noise with small semi-random fluctuations in amplitude and pitch. At least that's how most synthesizers reproduce it and it sounds quite convincing.
You have to check the spectral content and change in the wavefile, so you'll need FFT. Input format doesn't really matter, but obviously raw material (wav) is better.
Once you got that you should detect that it's close to some kind of colored noise and then perhaps extract series of pitch and amplitude and try to use classic pattern classification algorithm for that data set. I think supervised learning could work here.
This is actually a hard problem to solve.
Assuming you have only a single microphone data. The raw data you get when you open an audio file (time-domain signal) has some, but not a lot of information for this kind of processing. You need to go into the frequency domain using FFTs and look at the statistics of the the frequency bins and use that to build a classifier using SVM or Random Forests.
With all due respect to #Karoly-Horvath, I would also not use any recordings that has undergone compression, such as mp3. Audio compression algorithms always distorts the higher frequencies, which as it turns out, is an important feature in detecting wind now. If possible, get the raw PCM data from a mic.
You also need to make sure your recording is sampled at at least 24kHz so you have information of the signal up to 12kHz.
Finally - the wind shape in the frequency domain is not a simple filtered white noise. The characteristics is that it usually has high energy in the low frequencies (a rumbling type of sound) with sheering and flapping sounds in the high frequencies. The high frequency energy is quite transient, so if your FFT size is too big, you will miss this important feature.
If you have 2 microphone data, then this gets a little bit easier. Wind, when recorded, is a local phenomenon. Sure, in recordings, you can hear the rustling of leaves or the sound of chimes caused by the wind. But that is not wind-noise and should not be filtered out.
The actual annoying wind noise you hear in a recording is the air hitting the membrane of your microphone. That effect is a local event - and can be exploited if you have 2 microphones. It can be exploited because the event is local to each individual mic and is not correlated with the other mic. Of course, where the 2 mics are placed in relations to each other is also important. They have to be reasonably close to each other (say, within 8 inches).
A time-domain correlation can then be used to determine the presence of wind noise. (All the other recorded sound are correlated with each other because the mics are fairly close to each other, so a high correlation means no wind, low correlation means wind). If you are going with this approach, your input audio file need not be uncompressed. A reasonable compression algorithm won't affect this.
I hope this overview helps.

Chord detection algorithms?

I am developing software that depends on musical chords detection. I know some algorithms for pitch detection, with techniques based on cepstral analysis or autocorrelation, but they are mainly focused on monophonic material recognition. But I need to work with some polyphonic recognition, that is, multiple pitches at the same time, like in a chord; does anyone know some good studies or solutions on that matter?
I am currently developing some algorithms based on the FFT, but if anyone has an idea on some algorithms or techniques that I can use, it would be of great help.
This is quite a good Open Source Project:
It detects chords based on a chromagram - a good solution, breaks down a window of the whole spectrum onto an array of pitch classes (size: 12) with float values. Then, chords can be detected by a Hidden Markov Model.
.. should provide you with everything you need. :)
The author of Capo, a transcription program for the Mac, has a pretty in-depth blog. The entry "A Note on Auto Tabbing" has some good jumping off points:
I started researching different methods of automatic transcription in mid-2009, because I was curious about how far along this technology was, and if it could be integrated into a future version of Capo.
Each of these automatic transcription algorithms start out with some kind of intermediate represenation of the audio data, and then they transfer that into a symbolic form (i.e. note onsets, and durations).
This is where I encountered some computationally expensive spectral representations (The Continuous Wavelet Transform (CWT), Constant Q Transform (CQT), and others.) I implemented all of these spectral transforms so that I could also implement the algorithms presented by the papers I was reading. This would give me an idea of whether they would work in practice.
Capo has some impressive technology. The standout feature is that its main view is not a frequency spectrogram like most other audio programs. It presents the audio like a piano roll, with the notes visible to the naked eye.
(source: supermegaultragroovy.com)
(Note: The hard note bars were drawn by a user. The fuzzy spots underneath are what Capo displays.)
There's significant overlap between chord detection and key detection, and so you may find some of my previous answer to that question useful, as it has a few links to papers and theses. Getting a good polyphonic recogniser is incredibly difficult.
My own viewpoint on this is that applying polyphonic recognition to extract the notes and then trying to detect chords from the notes is the wrong way to go about it. The reason is that it's an ambiguous problem. If you have two complex tones exactly an octave apart then it's impossible to detect whether there are one or two notes playing (unless you have extra context such as knowing the harmonic profile). Every harmonic of C5 is also a harmonic of C4 (and of C3, C2, etc). So if you try a major chord in a polyphonic recogniser then you are likely to get out a whole sequence of notes that are harmonically related to your chord, but not necessarily the notes you played. If you use an autocorrelation-based pitch detection method then you'll see this effect quite clearly.
Instead, I think it's better to look for the patterns that are made by certain chord shapes (Major, Minor, 7th, etc).
See my answer to this question:
How can I do real-time pitch detection in .Net?
The reference to this IEEE paper is mainly what you're looking for: http://ieeexplore.ieee.org/Xplore/login.jsp?reload=true&url=/iel5/89/18967/00876309.pdf?arnumber=876309
The harmonics are throwing you off. Plus, humans can find fundamentals in sound even when the fundamental isn't present! Think of reading, but by covering half of the letters. The brain fills in the gaps.
The context of other sounds in the mix, and what came before, is very important to how we perceive notes.
This is a very difficult pattern matching problem, probably suitable for an AI technique such as training neural nets or genetic algorithms.
Basically, at every point in time, you guess the number of notes being play, the notes, the instruments that played the notes, the amplitudes, and the duration of the note. Then you sum the magnitudes of all the harmonics and overtones that all those instruments would generate when played at that volume at that point in thier envelope (attack, decay, etc.). Subtract the sum of all those harmonics from the spectrum of you signal, then minimize the difference over all possibilities. Pattern recognition of the thump/squeak/pluck transient noise/etc. at the very onset of the note might also be important. Then do some decision analysis to make sure your choices make sense (e.g. a clarinet didn't suddenly change into a trumpet playing another note and back again 80 mS later), to minimize the error probability.
If you can constrain your choices (e.g. only 2 flutes playing only quarter notes, etc.), especially to instruments with very limited overtone energy, it makes the problem a lot easier.
Also http://www.schmittmachine.com/dywapitchtrack.html
The dywapitchtrack library computes the pitch of an audio stream in real time. The pitch is the main frequency of the waveform (the 'note' being played or sung). It is expressed as a float in Hz.
And http://clam-project.org/ may help a little.
This post is a bit old, but I thought I'd add the following paper to the discussion:
Klapuri,Anssi; Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model; IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 2, FEBRUARY 2008 255
The paper acts somewhat like a literature review of multipitch analysis and discusses a method based on an auditory model:
(The image is from the paper. I don't know if I have to get permission to post it.)

Perceptual similarity between two audio sequences

I would like to get some sort of distance measure between two pieces of audio. For example, I want to compare the sound of an animal to the sound of a human mimicking that animal, and then return a score of how similar the sounds were.
It seems like a difficult problem. What would be the best way to approach it? I was thinking to extract a couple of features from the audio signals and then do a Euclidian distance or cosine similarity (or something like that) on those features. What kind of features would be easy to extract and useful to determine the perceptual difference between sounds?
(I saw somewhere that Shazam uses hashing, but that's a different problem because there the two pieces of audio being compared are fundamentally the same, but one has more noise. Here, the two pieces of audio are not the same, they are just perceptually similar.)
The process for comparing a set of sounds for similarities is called Content Based Audio Indexing, Retrieval, and Fingerprinting in computer science research.
One method of doing this is to:
Run several bits of signal processing on each audio file to extract features, such as pitch over time, frequency spectrum, autocorrelation, dynamic range, transients, etc.
Put all the features for each audio file into a multi-dimensional array and dump each multi-dimensional array into a database
Use optimization techniques (such as gradient descent) to find the best match for a given audio file in your database of multi-dimensional data.
The trick to making this work well is which features to pick. Doing this automatically and getting good results can be tricky. The guys at Pandora do this really well, and in my opinion they have the best similarity matching around. They encode their vectors by hand though, by having people listen to music and rate them in many different ways. See their Music Genome Project and List of Music Genome Project attributes for more info.
For automatic distance measurements, there are several projects that do stuff like this, including marsysas, MusicBrainz, and EchoNest.
Echonest has one of the simplest APIs I've seen in this space. Very easy to get started.
I'd suggest looking into spectrum analysis. Whilst this isn't as straightforward as you're most likely wanting, I'd expect that decomposing the audio into it's underlying frequencies would provide some very useful data to analyse. Check out this link
Your first step will definitely be taking a Fourier Transform(FT) of the sound waves. If you perform an FT on the data with respect to Frequency over Time1, you'll be able to compare how often certain key frequencies are hit over the course of the noise.
Perhaps you could also subtract one wave from the other, to get a sort of stepwise difference function. Assuming the mock-noise follows the same frequency and pitch trends2 as the original noise, you could calculate the line of best fit to the points of the difference function. Comparing the best fit line against a line of best fit taken of the original sound wave, you could average out a trend line to use as the basis of comparison. Granted, this would be a very loose comparison method.
- 1. hz/ms, perhaps? I'm not familiar with the unit magnitude being worked with here, I generally work in the femto- to nano- range.
- 2. So long as ∀ΔT, ΔPitch/ΔT & ΔFrequency/ΔT are within some tolerance x.
- Edited for formatting, and because I actually forgot to finish writing the full answer.

How to detect the BPM of a song in php [closed]

How can the tempo/BPM of a song be determined programmatically? What algorithms are commonly used, and what considerations must be made?
This is challenging to explain in a single StackOverflow post. In general, the simplest beat-detection algorithms work by locating peaks in sound energy, which is easy to detect. More sophisticated methods use comb filters and other statistical/waveform methods. For a detailed explication including code samples, check this GameDev article out.
The keywords to search for are "Beat Detection", "Beat Tracking" and "Music Information Retrieval". There is lots of information here: http://www.music-ir.org/
There is a (maybe) annual contest called MIREX where different algorithms are tested on their beat detection performance.
That should give you a list of algorithms to test.
A classic algorithm is Beatroot (google it), which is nice and easy to understand. It works like this:
Short-time FFT the music to get a sonogram.
Sum the increases in magnitude over all frequencies for each time step (ignore the decreases). This gives you a 1D time-varying function called the "spectral flux".
Find the peaks using any old peak detection algorithm. These are called "onsets" and correspond to the start of sounds in the music (starts of notes, drum hits, etc).
Construct a histogram of inter-onset-intervals (IOIs). This can be used to find likely tempos.
Initialise a set of "agents" or "hypotheses" for the beat-tracking result. Feed these agents the onsets one at a time in order. Each agent tracks the list of onsets that are also beats, and the current tempo estimate. The agents can either accept the onsets, if they fit closely with their last tracked beat and tempo, ignore them if they are wildly different, or spawn a new agent if they are in-between. Not every beat requires an onset - agents can interpolate.
Each agent is given a score according to how neat its hypothesis is - if all its beat onsets are loud it gets a higher score. If they are all regular it gets a higher score.
The highest scoring agent is the answer.
Downsides to this algorithm in my experience:
The peak-detection is rather ad-hoc and sensitive to threshold parameters and whatnot.
Some music doesn't have obvious onsets on the beats. Obviously it won't work with those.
Difficult to know how to resolve the 60bpm-vs-120bpm issue, especially with live tracking!
Throws away a lot of information by only using a 1D spectral flux. I reckon you can do much better by having a few band-limited spectral fluxes (and maybe one broadband one for drums).
Here is a demo of a live version of this algorithm, showing the spectral flux (black line at the bottom) and onsets (green circles). It's worth considering the fact that the beat is extracted from only the green circles. I've played back the onsets just as clicks, and to be honest I don't think I could hear the beat from them, so in some ways this algorithm is better than people at beat detection. I think the reduction to such a low-dimensional signal is its weak step though.
Annoyingly I did find a very good site with many algorithms and code for beat detection a few years ago. I've totally failed to refind it though.
Edit: Found it!
Here are some great links that should get you started:
Beat extraction involves the identification of cognitive metric structures in music. Very often these do not correspond to physical sound energy - for example, in most music there is a level of syncopation, which means that the "foot-tapping" beat that we perceive does not correspond to the presence of a physical sound. This means that this is a quite different field to onset detection, which is the detection of the physical sounds, and is performed in a different way.
You could try the Aubio library, which is a plain C library offering both onset and beat extraction tools.
There is also the online Echonest API, although this involves uploading an MP3 to a website and retrieving XML, so might not be so suitable..
EDIT: I came across this last night - a very promising looking C/C++ library, although I haven't used it myself. Vamp Plugins
The general area of research you are interested in is called MUSIC INFORMATION RETRIEVAL
There are many different algorithms that do this but they all are fundamentally centered around ONSET DETECTION.
Onset detection measures the start of an event, the event in this case is a note being played. You can look for changes in the weighted fourier transform (High Frequency Content) you can look for large changes in spectrial content. (Spectrial Difference). (there are a couple of papers that I recommend you look into further down) Once you apply an onset detection algorithm you pick off where the beats are via thresholding.
There are various algorithms that you can use once you've gotten that time localization of the beat. You can turn it into a pulse train (create a signal that is zero for all time and 1 only when your beat happens) then apply a FFT to that and BAM now you have a Frequency of Onsets at the largest peak.
Here are some papers to lead you in the right direction:
Here is an extension to what some people are discussing:
Someone mentioned looking into applying a machine learning algorithm: Basically collect a bunch of features from the onset detection functions (mentioned above) and combine them with the raw signal in a neural network/logistic regression and learn what makes a beat a beat.
look into Dr Andrew Ng, he has free machine learning lectures from Stanford University online (not the long winded video lectures, there is actually an online distance course)
If you can manage to interface with python code in your project, Echo Nest Remix API is a pretty slick API for python:
There's a method analysis.tempo which will give you the BPM. It can do a whole lot more than simple BPM, as you can see from the API docs or this tutorial
Perform a Fourier transform, and find peaks in the power spectrum. You're looking for peaks below the 20 Hz cutoff for human hearing. I'd guess typically in the 0.1-5ish Hz range to be generous.
SO question that might help: Bpm audio detection Library
Also, here is one of several "peak finding" questions on SO: Peak detection of measured signal
Edit: Not that I do audio processing. It's just a guess based on the fact that you're looking for a frequency domain property of the file...
another edit: It is worth noting that lossy compression formats like mp3, store Fourier domain data rather than time domain data in the first place. With a little cleverness, you can save yourself some heavy computation...but see the thoughtful comment by cobbal.
To repost my answer: The easy way to do it is to have the user tap a button in rhythm with the beat, and count the number of taps divided by the time.
Others have already described some beat-detection methods. I want to add that there are some libraries available that provide techniques and algorithms for this sort of task.
Aubio is one of them, it has a good reputation and it's written in C with a C++ wrapper so you can integrate it easily with a cocoa application (all the audio stuff in Apple's frameworks is also written in C/C++).
There are several methods to get the BPM but the one I find the most effective is the "beat spectrum" (described here).
This algorithm computes a similarity matrix by comparing each short sample of the music with every others. Once the similarity matrix is computed it is possible to get average similarity between every samples pairs {S(T);S(T+1)} for each time interval T: this is the beat spectrum. The first high peak in the beat spectrum is most of the time the beat duration. The best part is you can also do things like music structure or rythm analyses.
I'd imagine this will be easiest in 4-4 dance music, as there should be a single low frequency thud about twice a second.
