I read a lot of articles and discussions about FFT and algorithms about pitch detection (autocorrelation, overlapping windows, HPS...). When we play the piano so there are not only one frequency but many - high and low frequencies, so what is the best method for piano?
I dont know much about what you are trying to do, but it seems to me you need to run some high pass and low pass filters, isolate your data set so you can get the pitch from the noise. Piano has many frequencies to give you the tonal characteristics of its sound, but if you only care about pitch, (the exact note being hit) then the rest of the waveform is noise to you. Hope this helps.
Related
My app processes samples from microphone audio streams. The task I'm asking about: programmatically make a good guess at what ranges of the audio stream samples should be considered signal versus noise. The "signal", in this case, is human speech.
The audio I'm getting from users comes from recording environments that I can't control or know much about. These users may be speaking into a professional microphone from a treated space or into a crummy laptop mic in their living room. Very noisy environments with excessive background noise, e.g. a busy restaurant, are outside of what I need to accommodate.
It would make the analysis simpler, but I don't want to request the user to record a room noise sample within my app. And I don't want the user to manually designate a range of audio as silence.
If I'm just looking at recorded audio within a DAW, it is simple and intuitive to spot where silence (nobody talking) is within the waveform. Basically, there's a bunch of relatively flat horizontal lines that are closer to negative infinity db (absolute silence) than any other flat lines.
I have no problems using various sound APIs to access samples. I've omitted technical specifics because I'm just asking about a suitable algorithm, and even if an existing library were available, I'd prefer my own implementation.
What algorithm is good for classifying sample ranges as silence for the requirements I've described?
Here is one simple algorithm that has its pros and cons:
Perform the calculation of noise floor as described below once. Or to continuously respond to changing conditions during recording, you can perform this calculation at some reasonably responsive interval like once every second.
Let K = a constant ratio of noise to signal, e.g. 10%, expected in worst cases.
Let A = the sample with the highest amplitude from the microphone audio stream.
Noise floor is A * K
Once the noise floor has been calculated, you can apply it to any range of the audio stream samples to classify values above the noise floor as signal and below the noise floor as noise.
With the above algorithm, samples are assumed to be stored in a typical computer audio format, e.g. PCM, where 0 is silence and a negative/positive sample value is air pressure creating sound. Negative samples can be negated to positive values when evaluating them.
K could be something like 10%. It's the noise/signal ratio expected in one of the poorest recording environments you'd like to support. Analyzing test recordings will show what the ratio should be. The higher the ratio is, the more noise will be miss-classified as signal by the algorithm.
Pros:
Easy to implement.
Computationally inexpensive. O(n) for a single pass over a sample array to find the highest peak value.
Cons:
It depends on the samples used to calculate a noise floor having signal (speech) in them. So there has to be some way of knowing the samples contain signal outside of the algorithm.
Any loud noises, e.g. a hand clap, that aren't speech but have a higher amplitude, can cause the noise floor to raise above speech causing speech to be miss-classified as noise.
The K value is a fudge factor. There are more direct ways to detect the actual noise floor from the samples.
How would I take a song input and output the same song without certain frequencies?
Based on my research so far, the song should be broken down into chucks, FFT it, reduce the target frequencies, iFFT it, and stitch the chunks back together. However, I am unsure if this is the right approach to take, and if so, how I would convert from the audio to FFT input (what seems to be a vector matrix), how to reduce the target frequencies, and how to convert back from the FFT output to audio, and how to restitch.
For background, my grandpa loves music. However, recently he cannot listen to it, as he has become hypersensitive to certain frequencies within songs. I'm a high school student who has some background coding, and am just getting into algorithmic work and thus have very little experience using these algorithms. Please excuse me if these are basic questions; any pointers would be helpful.
EDIT: So far, I've understood the basic premise of fft (through basic 3blue1brown yt videos and the like) and that it is available through scipy/numpi, and figured out how to convert from youtube to 0.25 second chunks in a wav format.
Your approach is right.
Concerning subquestions:
from the audio to FFT input - assign audio samples to real part of complex signal, imaginary part is zero
how to reduce the target frequencies - multiply FFT results near needed frequency by smooth (to diminish artifacts) function that becomes zero at that freqency and reaches 1.0 some samples apart
how to convert back - just make inverse FFT (don't forget about scale multiplier like 1/N) and copy real part into audio channel
Also consider using of simpler digital filtering - band-stop or notch filter.
Arbitrary found example1 example2
(calculation of parameters perhaps requires some understanding of DSP)
Current situation:
I have implemented a particle filter for an indoor localisation system. It uses fingerprints of the magnetic field. The implementation of the particle filter is pretty straight forward:
I create all particles uniformly distributed over the entire area
Each particle gets a velocity (gaussian distributed with the mean of a 'normal' walk speed) and a direction (uniformly distributed in all directions)
Change velocity and direction (both gaussian distributed)
Move all particles in the given direction by velocity multiplied by the time difference of the last and the current measurement
Find the closest fingerprint of each particle
Calculate the new weight of each particle by comparing the closest fingerprint and the given measurement
Normalize
Resample
Repeat #3 to #9 for every measurement
The problem:
Now I would like to do basically the same but add another sensor to the system (namely WiFi measurements). If the measurements would appear at the same time there wouldn't be a problem. Then I would just calculate the probability for the first sensor and multiply this by the probability of the second sensor to get my weight for the particle at #6.
But the magnetic field sensor has a very high sample rate (about 100 Hz) and the WiFi measurement appears roughly every second.
I don't know what would be the best way to handle the problem.
Possible solutions:
I could throw away (or average) all the magnetic field measurements till a WiFi measurement appears and use the last magnetic field measurement (or the average) and the WiFi signal together. So basically I reduce the sample rate of the magentic field sensor to the rate of the WiFi sensor
For every magnetic field measurement I use the last seen WiFi measurement
I use the sensors separated. That means if I get a measurement of one sensor I do all the steps #3 to #9 without using any measurement data of the other sensor
Any other solution I haven't thought about ;)
I'm not sure which would be the best solution. All the solutions dont seem to be good.
With #1 I would say I'm loosing information. Although I'm not sure if it makes sense to use a sample rate of about 100 Hz for a particle filter.
At #2 I have to assume that the WiFi signal doesn't chance quickly which I can't prove.
If I use the sensors separately the magnetic field measurements become more important than the WiFi measurements since all the steps will have happened 100 times with the magnetic data till one WiFi measurement appears.
Do you know a good paper which is dealing with this problem?
Is there already a standard solution how to handle multiple sensors with different sample sizes in a particle filter?
Does a sample size of 100 Hz make sense? Or what would be a proper time difference for one step of the particle filter?
Thank you very much for any kind of hint or solution :)
In #2 instead of using sample-and-hold you could delay the filter by 1s and interpolate between WiFi-measurements in order to up-sample so you have both signals at 100Hz.
If you know more about the WiFi behavior you could use something more advanced than linear interpolation to model the Wifi behavior between updates. These folks use a more advanced asynchronous hold to up-sample the slower sensor signal but something like a Kalman filter might also work.
With regards to update speed I think 100Hz sounds high for your application (assuming you are doing positioning of a human walking indoors) since you are likely to take a lot of noise into account, lowering the sampling frequency is a cheap way to filter out high-frequency noise.
Can anybody explain me (simplified) what happen if I do an image comparison with FFT? I somehow don't understand how it's possible to convert a picture into frequencies and how this is used to differentiate between two images. Via Google I can not find a simple description, which I (as non mathematic/informatic) could understand.
Any help would be very appreaciated!
Thanks!
Alas, a good description of an FFT might involve subjects such as the calculus of complex variables and the computational theory of recursive algorithms. So a simple description may not be very accurate.
Think about sound. Looking at the waveform of the sound produced by two singers might not tell you much. The two waveforms would just be a complicated long and messy looking squiggles. But a frequency meter could quickly tell you that one person was singing way off pitch and whether they were a soprano or bass. So you might be able to determine that certain waveforms did not indicate a good match for who was singing from the frequency meter readings.
An FFT is like a big bunch of frequency meters. And each scan line of a photo is a waveform.
Around 2 centuries ago, some guy named Fourier proved that any reasonable looking waveform squiggle could be matched by an appropriate bunch of just sine waves, each at a single frequency. Other people several decades ago figured out a very clever way of very quickly calculating just which bunch of sine waves that that was. The FFT.
Discrete FFT transforms a (2D) matrix of let's say, pixel values, into a 2D matrix in frequency domain. You can use a library like FFTW to convert an image from the ordinary form to the spectral one. The result of your comparison depends on what you really compare.
Fourier transform works in other dimensions than 2d, as well. But you'll be interested in a 2D FFT.
I'd like to know if there is any good (and freely available) text, on how to obtain motion vectors of macro blocks in raw video stream. This is often used in video compression, although my application of it is not video encoding.
Code that does this is available in OSS codecs, but understanding the method by reading the code is kinda hard.
My actual goal is to determine camera motion in 2D projection space, assuming the camera is only changing it's orientation (NOT the position). What I'd like to do is divide the frames into macro blocks, obtain their motion vectors, and get the camera motion by averaging those vectors.
I guess OpenCV could help with this problem, but it's not available on my target platform.
The usual way is simple brute force: Compare a macro block to each macro block from the reference frame and use the one that gives the smallest residual error. The code gets complex primarily because this is usually the slowest part of mv-based compression, so they put a lot of work into optimizing it, often at the expense of anything even approaching readability.
Especially for real-time compression, some reduce the workload a bit by (for example) restricting the search to the original position +/- some maximum delta. This can often gain quite a bit of compression speed in exchange for a fairly small loss of compression.
If you assume only camera motion, I suspect there is something possible with analysis of the FFT of successive images. For frequencies whose amplitudes have not changed much, the phase information will indicate the camera motion. Not sure if this will help with camera rotation, but lateral and vertical motion can probably be computed. There will be difficulties due to new information appearing on one edge and disappearing on the other and I'm not sure how much that will hurt. This is speculative thinking in response to your question, so I have no proof or references :-)
Sounds like you're doing a very limited SLAM project?
Lots of reading matter at Bristol University, Imperial College, Oxford University for example - you might find their approaches to finding and matching candidate features from frame to frame of interest - much more robust than simple sums of absolute differences.
For the most low-level algorithms of this type the term you are looking for is optical flow and one of the easiest algorithms of that class is the Lucas Kanade algorithm.
This is a pretty good overview presentation that should give you plenty of ideas for an algorithm that does what you need