Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to develop an app for detecting wind according the audio stream.
I need some expert thoughts here, just to give me guide lines or some links, I know this is not easy task but I am planning to put a lot of effort here.
My plan is to detect some common patterns in the stream, and if the values are close to this common patterns of the wind noise I will notify that match is found, if the values are closer to the known pattern great, I can be sure that the wind is detected, if the values doesn't match with the patterns then I guess there is no so much wind....
That is my plan at first, but I need to learn how this things are done. Is there some open project already doing this ? Or is there someone who is doing research on this topics ?
The reason I write on this forum is because I do not know how to google it, the things I found was not I was looking for. I really do not know how to start developing this kind of algorithm.
EDIT 1 :
I tried to record a wind, and when I open the saved audio file for me it was just a bunch of numbers :). I do not even see in what format should I save this, is wave good enough ? Should I use something else, or what if I convert the wind noise audio file in mp3 : is this gonna help with parsing ?
Well I got many questions, that is because I do not know from where to read more about this kind of topic. I tag my question with guidlines so I hope someone will help me.
There must be something that is detectable, cause the wind noise is so common, there must be somehow to detect this, we need only someone to give me tips, someone who is familiar with this topic.
I just came across this post I have recently made a library which can detect wind noise in recordings.
I made a model of wind noise and created a database of examples and then trained a Machine Learning algorithm to detect and meter the wind level in a perceptually weighted way.
The C++/C code is here if it is of use to anyone!
The science for your problem is called "pattern classification", especially the subfield of "audio pattern classification". The task is abstracted as classifying a sound recording into two classes (wind and not wind). You seem to have no strong background in signal processing yet, so let me insert one central warning:
Pattern classification is not as easy as it looks at first. Humans excel at pattern classification. Computers don't.
A good first approach is often to compute the correlation of the Fourier transform of your signal and a sample. Don't know how much that will depend on wind speed, however.
You might want to have a look at the bag-of-frames approach, it was used successfully to classify ambient noise.
As #thiton mentioned this is an example of audio pattern classification.
Main characteristics for wind: it's a shaped (band/hp filtered) white noise with small semi-random fluctuations in amplitude and pitch. At least that's how most synthesizers reproduce it and it sounds quite convincing.
You have to check the spectral content and change in the wavefile, so you'll need FFT. Input format doesn't really matter, but obviously raw material (wav) is better.
Once you got that you should detect that it's close to some kind of colored noise and then perhaps extract series of pitch and amplitude and try to use classic pattern classification algorithm for that data set. I think supervised learning could work here.
This is actually a hard problem to solve.
Assuming you have only a single microphone data. The raw data you get when you open an audio file (time-domain signal) has some, but not a lot of information for this kind of processing. You need to go into the frequency domain using FFTs and look at the statistics of the the frequency bins and use that to build a classifier using SVM or Random Forests.
With all due respect to #Karoly-Horvath, I would also not use any recordings that has undergone compression, such as mp3. Audio compression algorithms always distorts the higher frequencies, which as it turns out, is an important feature in detecting wind now. If possible, get the raw PCM data from a mic.
You also need to make sure your recording is sampled at at least 24kHz so you have information of the signal up to 12kHz.
Finally - the wind shape in the frequency domain is not a simple filtered white noise. The characteristics is that it usually has high energy in the low frequencies (a rumbling type of sound) with sheering and flapping sounds in the high frequencies. The high frequency energy is quite transient, so if your FFT size is too big, you will miss this important feature.
If you have 2 microphone data, then this gets a little bit easier. Wind, when recorded, is a local phenomenon. Sure, in recordings, you can hear the rustling of leaves or the sound of chimes caused by the wind. But that is not wind-noise and should not be filtered out.
The actual annoying wind noise you hear in a recording is the air hitting the membrane of your microphone. That effect is a local event - and can be exploited if you have 2 microphones. It can be exploited because the event is local to each individual mic and is not correlated with the other mic. Of course, where the 2 mics are placed in relations to each other is also important. They have to be reasonably close to each other (say, within 8 inches).
A time-domain correlation can then be used to determine the presence of wind noise. (All the other recorded sound are correlated with each other because the mics are fairly close to each other, so a high correlation means no wind, low correlation means wind). If you are going with this approach, your input audio file need not be uncompressed. A reasonable compression algorithm won't affect this.
I hope this overview helps.
Related
Consider a situation where you have multiple microphones, each capable of transmitting the audio they pick up over a wifi network (meaning that the audio can be time delayed by several milliseconds or more).
Is there an algorithm that can combine the audio from multiple microphones to produce a higher quality audio recording, detecting and correcting for any time delay?
For the detection/correction of time delay, you might want to look for "feature extraction". It determines key points in the audio to match up.
This works best if all the microphones are hearing (roughly) the same thing, though. For a studio-type environment, where each mic is directional and aimed at a different instrument, it may have a very hard time identifying common features.
I'm unsure of what "higher quality" means to you, though. I assume you mean the least amount of noise. If that's the case, you might be interested in this answer, which is about noise detection. You can calculate the signal/noise ratio of each input and weight them as you see fit when combining.
There are other ways to reduce noise as well. You could simply run one of many noise reduction techniques on each input, or on the mixed output.
If you mean something else by "quality", then you might be headed into tougher areas. There is a reason professional mixers get paid, because computers aren't good at telling what sounds "better".
Of course, there may not be any need to reinvent the wheel at all. There are probably several open-source programs that do this kind of stuff. I'd think the Audacity source would have everything you want.
I have the following problem: I have 2 signals over time. They are from the same source so they should be the same. I want to check if they really are.
Complications:
they may be measured with different sample rates
the start / end time do not correlate. The measurement does not start at the same time and end at the same time.
there may be an time offset between the two signals.
My thoughts go along Fourier transformation, convolution and statistical methods for comparison. Can someone post me some links where I can find more information on how to handle this?
You can easily correct for the phase by just shifting them so their centers of mass line up. (Or alternatively, in the Fourier domain just multiplying by the inverse of the phase of the first coefficient.)
Similarly, if you want to line up the images given only partial data, you can just cross correlate and take the maximal value (which is again easy to do in the Fourier domain).
That leaves the only tricky part of this process as dealing with the sampling rates. Now if you know a-priori what the sample rates are, (and if they are related by a rational number), you can just use sinc interpolation/downsampling to rescale them to a common sampling rate:
https://ccrma.stanford.edu/~jos/st/Bandlimited_Interpolation_Time_Limited_Signals.html
If you don't know the sampling rate, you may be a bit screwed. Technically, you can try just brute forcing over all the different rescalings of your signal, but doing this tends to be either slow or else give mediocre results.
As a last suggestion, if you just want to match sounds exactly you can try using the cepstrum and verifying that the peaks of the signal are close enough to within some tolerance. This type of analysis is used a lot in sound and speech recognition, with some refinements to make it operate a bit more locally. It tends to work best with frequency modulated data like speech and music:
http://en.wikipedia.org/wiki/Cepstrum
Fourier transformation does sound like the right way.
There is too much mathematical information for me to just start explaining here so if you really wanna know what's going on with that (cause I don't think you can just use FT without understanding it) you should use this reference from MIT OpenCourseWare: http://ocw.mit.edu/courses/mathematics/18-103-fourier-analysis-theory-and-applications-spring-2004/lecture-notes/
Hope it helped.
If you are working with a linux box and the waveforms that need to be processed have already been recorded, you can try to use the file command to display details about the recording. It gives you the sampling rate when it is invoked on a wav file, though I am not sure what format you are recording in.
If the signals are time-shifted with respect to each other, you may try to convolve one with a delta function with increasing delays and then comparing. On MATLAB, conv and all should be good enough.
These are just 'crude' attempts (almost like hacking at the problem). There may be algorithms that are shift-invariant that may do a better job.
Hope that helps.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm not entirely sure this is the correct stack exchange subsite to post this question to, but...
I'm looking for an algorithm that I can use to determine with a decent amount of certainty if a given piece of audio is music or not. Just a boolean result is fine, I don't need to know the key, bpm or anything like that, I just need to be able to determine if it appears to be music (as opposed to speech). Programming language is irrelevant, but I'll end up converting it to Python.
In a phrase, Fourier analysis. Look at the power of different frequencies over time. Here's speech, and here's violin playing. The former shows dramatic changes with every syllable; the 'flow' is very disjoint and could be picked up by an algorithm which took the derivative of the different frequency bands as a function of time. In paradigmatic music, on the other hand, the transitions are much smoother and the tones are purer (less 'blur' in the graph). See also the 'spectrogram' wikipedia page.
What you could do is set up a few Karplus Strong resonance rings and through the audio through them, and just monitor the level of energy in each ring
if it is Western music, it is pretty much all tuned to 12-TET, ie logarithmic 12 tone scale based around concert pitch A4#440Hz
so just pick 3 or 4 notes equally spaced through the octave eg C5, (omit C# D D#) E5 (omit F F# G) G#5 (omit A A# B)
and at least one of those rings will be flaring regularly -- whichever key the music is in, it's probably going to hit one of those notes quite a lot
ideally do it for a bunch of notes, but if you need this real-time it can get a bit heavy feeding your audio simultaneously into 50 rings
alternatively you could even use a pitch detector and catalogue recorded pitches, and look at ratios of log(noteAfreq):log(noteBfreq) see whether they are arranging themselves into low order fractions like 3:4 += 0.5%. but I don't think anyone has built a decent polyphonic pitch detector -- it is practically impossible.
Melodyne might have pulled it off
If it's just a vocal signal you can e-mail me.
For some reason this question has attracted a large number of really bad answers.
Use pyAudioAnalysis. Also, Google "audio feature analysis".
On its surface, this sounds like a hard problem, but there's been an explosion of great work on classifiers in the past 20 years, so many well-documented solutions exist. Most classifiers today usually can figure this out with an error rate of only a few percent. Some classifiers can even figure out what genre of music it is.
Most current algorithms for doing this break down into detecting a bunch of statistical representations of the input audio (features), and then doing a set of automatic classifications of the inputs based on previous training data.
pyAudioAnalysis is one library for extracting these features and then training a kNN or other mixed model based on the detected features. There are many more comparable libraries, such as Essentia for C++. Essentia also has Python bindings.
An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics is a good introductory book.
Look for a small "First differential" over a sequence of FFTs that are in the range of music tones (ie: 1024 samples per chunk run through FFT, then plot chunk1-chunk0,chunk2-chunk1,...). As a first approximation, this should be enough to detect simple things.
This is the sort of algorithm that could be tweaked forever, even in genre-specific ways. Music itself is generally periodic as well, so coming up with a way to run FFTs over the FFTs. And the idea to look for a consistent twelfth root of two spread of outstanding frequencies sounds really plausible.
I bet you were hoping to find this sitting in an free Python library for you to simply drop a file into. :-)
How can you determine the best audio quality in a list of audio files of the same audio clip, with out looking at the audio file's header. The tricky part is that all of the files came from different formats and bit rates and they where all transcoded to the same format and bit rate. How can this be done efficiently?
Many of the answers outlined here refer to common audio measurements such as THD+N, SNR, etc. However, these do not always correlate well with human hearing of audio artifacts. Lossy audio compression techniques typically function by increasing THD+N and SNR, but aim to do so in ways that are difficult for the human ear to detect. A more traditional audio measurement technique may find decreased SNR in a certain frequency band, but does that matter if there's so much energy in adjacent bands that no one would ever notice the difference?
The research paper titled "A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation" outlines an algorithm for quantifying the ability of the human ear to detect audible differences, based on a model of how the ear hears. It takes into factors that do correlate with audio quality as perceived by humans. The paper includes a study comparing their algorithm's results to subjective double-blind testing, to give you an idea of how well their model works.
I could not find a free copy of this paper but a decent university library should have it on file.
Implementing the algorithm would require some knowledge of audio signal processing in the frequency domain. An undergraduate with DSP experience should be able to implement it. If you don't have the reference waveform, you could use information in this paper to quantify how objectionable artifacts may be.
The algorithm would work on PCM audio, preferably time-aligned, and certainly does not require knowledge of the file type or header.
I'm not a software developer (I'm an audio engineer) and what you hear when you compress with mp3 algorithms is:
- less high frequencies: so you can check a loss in the energy of the higher range
- distorted stereo: so you can make a Mid/Side matrix, and check for the THD in the Side
- less phase coherency: maybe you can check that with a correlation meter
Hope it helps, it's a difficult task for a computer!
First, I'm not an audio engineer, but I've been trying to keep in touch about audio compression in general because I have a big mp3 collection and I have some thoughts to share about the subject.
Is the best audio quality that you're looking for from an human perspective? If so, you can't measure by "objective means" like comparing spectograms and such.
If a spectogram is ugly, it doesn't necessarily mean the quality is terrible. What matters is if someone can distinguish an encoded file from an original source doing a blind test. Period. If you want to check the quality of an encoded audio track you have to conduct a blind ABX test.
LAME (and all other kinds of lossy
MP3, AAC, AC3, DTS, ATRAC...
compressors) is so called perceptual
coder. It exploits certain facts about
the nature of human audio perception.
So, you cannot rely simply on
spectrograms to evaluate its quality.
Source
Now, if your objectives are from objective manners/perspectives, you could use EAQUAL, which stands for Evaluation Of Audio Quality:
It's an objective measurement
technique used to measure the quality
of encoded/decoded audio files (very
similiar to PEAQ)
(...)
The results, however when using
objective testing methodologies are
still inconclusive and mostly only
used by codec developers and
researchers.
...or Friedman statistical analysis tool.
(...) performs several statistical
analysis on data sets, which is
particularly suited for listening test
data.
I'm not saying that spectrum analyzers are useless. That's why I posted some utilities. I'm just saying to be careful with all these statistical methods: as someone at the Hydrogenaudio community said once, You don't listen with your eyes. (check this thread I posted as well, it's a great resource). To really prove audio quality from an human perspective, you should test ears and not graphs.
This is a complicated subject, and IMHO I suggest you to look for a specialized audio community like Hydrogenaudio.
If I understand correctly, you have a bunch of audio files that started in different formats with varying quality. They've all been converted to the same format, so you can't use the header to figure out which ones were originally high quality and which ones weren't.
This is a hard problem. There are potentially a few tricks that could catch some quality problems, but detecting, say, something that was converted from a low-bitrate compression algorithm like MP3 would be very hard.
Some easy tricks:
Check the maximum amplitude - if it's low, the quality won't be good.
Measure the highest frequency - if it's low, the original might have had a lower sample rate.
If you have the original you can estimate how it was altered by estimating a transfer function. You will need to assume some model, maybe start with a low-pass filter, add some smudging (convolution) and then run an estimator to produce a measure of quality. You could look around on the wikipedia article on Estimation_theory
I think that disown's answer is good, assuming that you are just trying to estimate a set of parameters. Unfortunately, you also have to define a comparison function for the parameters you have estimated.
What happens if two compressions have both applied a band-pass filter with equally large frequency ranges, but one of the admits higher frequencies than the other. Is one of them better? Which one?
The answer probably depends on which frequencies are being used more in the files you are working with.
An objective measure would be to see which file has lost less entropy. Unfortunately, this is not easy to do correctly.
I'm not too sure about this, but here's a good place to start:
http://en.wikipedia.org/wiki/Signal-to-noise_ratio
I don't think you can calculate SNR from one signal, but if you have a collection of signals then you might be able to work out the SNR comparing all of them.
There are some interesting links at the bottom of the page which could provide some routes of interest as well if that isn't possible.
Also, I'm not an audio engineer, but I know a little about signal processing, is there any way you can measure quantisation levels in audio signals? Perhaps something to look into.
If you do not have the original audio, this is probably a lot of work; it's almost certainly fundamentally impossible in an absolute sense since you can't tell which track's peculiarities are intentional and which bogus. You may even have encodings from different recordings or mixes, in which case plain comparison is fairly meaningless in any case.
Thus, assuming you do not have the original, the best you can probably do is a heuristic approach - which will probably work quite well, but be a lot of effort to implement.
Invest in some audio-processing software and skill; use this to build software to identify common encoder defects heuristically based solely on the output. Such defects might be poor temporal locality of sound hits (suggestion overlarge windows in the compression), high correlation between left and right signals, limited frequency range, etc. (a person with the right experience can probably list dozens).
Rate the quality of the audio on each heuristic on some sliding scale.
Use common sense and as much time+people for testing as you have to weigh the various factors for relevance. For example, while it might be nice to have frequency reproduction up to 24Khz, it's not very important; on the other had lack of sharpness may be more annoying.
If you're lucky, someone's done the job before you, because this sounds like an expensive proposition.
A New Perceptual Quality Measure for
Bit Rate Reduced Audio
http://citeseer.ist.psu.edu/cache/papers/cs/15888/http:zSzzSzwww-ft.ee.tu-berlin.dezSzPublikationenzSzpaperszSzAES1996Copenhagen.pdf/a-new-perceptual-quality.pdf
Perceptual audio coding algorithms
perform a drastic irrelevancy
reduction in order to achieve a high
coding gain. Signal components that
are assumed to be unperceivable are
not transmitted and the coding noise
is spectrally shaped according to the
masking threshold of the audio signal.
Simple quality measures (e.g. signal
to noise ratio, harmonic distortions),
which can not separate these inaudible
artefacts from audible errors, can not
be used to assess the performance of
such coders.
For the quality evaluation of
perceptual audio codecs, appropriate
measurement algorithms are needed,
which detect and assess audible
artefacts by comparing the output of
the codec with the uncoded reference.
A filter bank based perceptual model
is presented, which yields better
temporal resolution than FFT-based
approaches and thus allows a more
precise modelling of pre- and
post-masking and a refined analysis of
the envelopes within each filter
channel.
See Also
http://academic.research.microsoft.com/Paper/201987.aspx?viewType=1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can the tempo/BPM of a song be determined programmatically? What algorithms are commonly used, and what considerations must be made?
This is challenging to explain in a single StackOverflow post. In general, the simplest beat-detection algorithms work by locating peaks in sound energy, which is easy to detect. More sophisticated methods use comb filters and other statistical/waveform methods. For a detailed explication including code samples, check this GameDev article out.
The keywords to search for are "Beat Detection", "Beat Tracking" and "Music Information Retrieval". There is lots of information here: http://www.music-ir.org/
There is a (maybe) annual contest called MIREX where different algorithms are tested on their beat detection performance.
http://nema.lis.illinois.edu/nema_out/mirex2010/results/abt/mck/
That should give you a list of algorithms to test.
A classic algorithm is Beatroot (google it), which is nice and easy to understand. It works like this:
Short-time FFT the music to get a sonogram.
Sum the increases in magnitude over all frequencies for each time step (ignore the decreases). This gives you a 1D time-varying function called the "spectral flux".
Find the peaks using any old peak detection algorithm. These are called "onsets" and correspond to the start of sounds in the music (starts of notes, drum hits, etc).
Construct a histogram of inter-onset-intervals (IOIs). This can be used to find likely tempos.
Initialise a set of "agents" or "hypotheses" for the beat-tracking result. Feed these agents the onsets one at a time in order. Each agent tracks the list of onsets that are also beats, and the current tempo estimate. The agents can either accept the onsets, if they fit closely with their last tracked beat and tempo, ignore them if they are wildly different, or spawn a new agent if they are in-between. Not every beat requires an onset - agents can interpolate.
Each agent is given a score according to how neat its hypothesis is - if all its beat onsets are loud it gets a higher score. If they are all regular it gets a higher score.
The highest scoring agent is the answer.
Downsides to this algorithm in my experience:
The peak-detection is rather ad-hoc and sensitive to threshold parameters and whatnot.
Some music doesn't have obvious onsets on the beats. Obviously it won't work with those.
Difficult to know how to resolve the 60bpm-vs-120bpm issue, especially with live tracking!
Throws away a lot of information by only using a 1D spectral flux. I reckon you can do much better by having a few band-limited spectral fluxes (and maybe one broadband one for drums).
Here is a demo of a live version of this algorithm, showing the spectral flux (black line at the bottom) and onsets (green circles). It's worth considering the fact that the beat is extracted from only the green circles. I've played back the onsets just as clicks, and to be honest I don't think I could hear the beat from them, so in some ways this algorithm is better than people at beat detection. I think the reduction to such a low-dimensional signal is its weak step though.
Annoyingly I did find a very good site with many algorithms and code for beat detection a few years ago. I've totally failed to refind it though.
Edit: Found it!
Here are some great links that should get you started:
http://marsyasweb.appspot.com/
http://www.vamp-plugins.org/download.html
Beat extraction involves the identification of cognitive metric structures in music. Very often these do not correspond to physical sound energy - for example, in most music there is a level of syncopation, which means that the "foot-tapping" beat that we perceive does not correspond to the presence of a physical sound. This means that this is a quite different field to onset detection, which is the detection of the physical sounds, and is performed in a different way.
You could try the Aubio library, which is a plain C library offering both onset and beat extraction tools.
There is also the online Echonest API, although this involves uploading an MP3 to a website and retrieving XML, so might not be so suitable..
EDIT: I came across this last night - a very promising looking C/C++ library, although I haven't used it myself. Vamp Plugins
The general area of research you are interested in is called MUSIC INFORMATION RETRIEVAL
There are many different algorithms that do this but they all are fundamentally centered around ONSET DETECTION.
Onset detection measures the start of an event, the event in this case is a note being played. You can look for changes in the weighted fourier transform (High Frequency Content) you can look for large changes in spectrial content. (Spectrial Difference). (there are a couple of papers that I recommend you look into further down) Once you apply an onset detection algorithm you pick off where the beats are via thresholding.
There are various algorithms that you can use once you've gotten that time localization of the beat. You can turn it into a pulse train (create a signal that is zero for all time and 1 only when your beat happens) then apply a FFT to that and BAM now you have a Frequency of Onsets at the largest peak.
Here are some papers to lead you in the right direction:
https://web.archive.org/web/20120310151026/http://www.elec.qmul.ac.uk/people/juan/Documents/Bello-TSAP-2005.pdf
https://adamhess.github.io/Onset_Detection_Nov302011.pdf
Here is an extension to what some people are discussing:
Someone mentioned looking into applying a machine learning algorithm: Basically collect a bunch of features from the onset detection functions (mentioned above) and combine them with the raw signal in a neural network/logistic regression and learn what makes a beat a beat.
look into Dr Andrew Ng, he has free machine learning lectures from Stanford University online (not the long winded video lectures, there is actually an online distance course)
If you can manage to interface with python code in your project, Echo Nest Remix API is a pretty slick API for python:
There's a method analysis.tempo which will give you the BPM. It can do a whole lot more than simple BPM, as you can see from the API docs or this tutorial
Perform a Fourier transform, and find peaks in the power spectrum. You're looking for peaks below the 20 Hz cutoff for human hearing. I'd guess typically in the 0.1-5ish Hz range to be generous.
SO question that might help: Bpm audio detection Library
Also, here is one of several "peak finding" questions on SO: Peak detection of measured signal
Edit: Not that I do audio processing. It's just a guess based on the fact that you're looking for a frequency domain property of the file...
another edit: It is worth noting that lossy compression formats like mp3, store Fourier domain data rather than time domain data in the first place. With a little cleverness, you can save yourself some heavy computation...but see the thoughtful comment by cobbal.
To repost my answer: The easy way to do it is to have the user tap a button in rhythm with the beat, and count the number of taps divided by the time.
Others have already described some beat-detection methods. I want to add that there are some libraries available that provide techniques and algorithms for this sort of task.
Aubio is one of them, it has a good reputation and it's written in C with a C++ wrapper so you can integrate it easily with a cocoa application (all the audio stuff in Apple's frameworks is also written in C/C++).
There are several methods to get the BPM but the one I find the most effective is the "beat spectrum" (described here).
This algorithm computes a similarity matrix by comparing each short sample of the music with every others. Once the similarity matrix is computed it is possible to get average similarity between every samples pairs {S(T);S(T+1)} for each time interval T: this is the beat spectrum. The first high peak in the beat spectrum is most of the time the beat duration. The best part is you can also do things like music structure or rythm analyses.
I'd imagine this will be easiest in 4-4 dance music, as there should be a single low frequency thud about twice a second.