Related
I would like to find the time instant at which a certain value is reached in a time-series data with noise. If there are no peaks in the data, I could do the following in MATLAB.
Code from here
% create example data
d=1:100;
t=d/100;
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold
But when there is noise, I am not sure what has to be done.
In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a peak that reaches 5 somewhere around 20 s . I would like to know how to detect e.g 100 seconds as the right time and not 20 s . The code posted above will only give 20 s as the answer. I
saw a post here that explains using a sliding window to find when the data equilibrates. However, I am not sure how to implement the same. Suggestions will be really helpful.
The sample data plotted in the above image can be found here
Suggestions on how to implement in Python or MATLAB code will be really helpful.
EDIT:
I don't want to capture when the peak (/noise/overshoot) occurs. I want to find the time when equilibrium is reached. For example, around 20 s the curve rises and dips below 5. After ~100 s the curve equilibrates to a steady-state value 5 and never dips or peaks.
Precise data analysis is a serious business (and my passion) that involves a lot of understanding of the system you are studying. Here are comments, unfortunately I doubt there is a simple nice answer to your problem at all -- you will have to think about it. Data analysis basically always requires "discussion".
First to your data and problem in general:
When you talk about noise, in data analysis this means a statistical random fluctuation. Most often Gaussian (sometimes also other distributions, e.g. Poission). Gaussian noise is a) random in each bin and b) symmetric in negative and positive direction. Thus, what you observe in the peak at ~20s is not noise. It has a very different, very systematic and extended characteristics compared to random noise. This is an "artifact" that must have a origin, but of which we can only speculate here. In real-world applications, studying and removing such artifacts is the most expensive and time-consuming task.
Looking at your data, the random noise is negligible. This is very precise data. For example, after ~150s and later there are no visible random fluctuations up to fourth decimal number.
After concluding that this is not noise in the common sense it could be a least two things: a) a feature of the system you are studying, thus, something where you could develop a model/formula for and which you could "fit" to the data. b) a characteristics of limited bandwidth somewhere in the measurement chain, thus, here a high-frequency cutoff. See e.g. https://en.wikipedia.org/wiki/Ringing_artifacts . Unfortunately, for both, a and b, there are no catch-all generic solutions. And your problem description (even with code and data) is not sufficient to propose an ideal approach.
After spending now ~one hour on your data and making some plots. I believe (speculate) that the extremely sharp feature at ~10s cannot be a "physical" property of the data. It simply is too extreme/steep. Something fundamentally happened here. A guess of mine could be that some device was just switched on (was off before). Thus, the data before is meaningless, and there is a short period of time afterwards to stabilize the system. There is not really an alternative in this scenario but to entirely discard the data until the system has stabilized at around 40s. This also makes your problem trivial. Just delete the first 40s, then the maximum becomes evident.
So what are technical solutions you could use, please don't be too upset that you have to think about this yourself and assemble the best possible solution for your case. I copied your data in two numpy arrays x and y and ran the following test in python:
Remove unstable time
This is the trivial solution -- I prefer it.
plt.figure()
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x, y, label="original")
y_cut = y
y_cut[:40] = 0
plt.plot(x, y_cut, label="cut 40s")
plt.legend()
plt.grid()
plt.show()
Note carry on reading below only if you are a bit crazy (about data).
Sliding window
You mentioned "sliding window" which is best suited for random noise (which you don't have) or periodic fluctuations (which you also don't really have). Sliding window just averages over consecutive bins, averaging out random fluctuations. Mathematically this is a convolution.
Technically, you can actually solve your problem like this (try even larger values of Nwindow yourself):
Nwindow=10
y_slide_10 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=20
y_slide_20 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=30
y_slide_30 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x,y, label="original")
plt.plot(x,y_slide_10, label="window=10")
plt.plot(x,y_slide_20, label='window=20')
plt.plot(x,y_slide_30, label='window=30')
plt.legend()
#plt.xscale('log') # useful
plt.grid()
plt.show()
Thus, technically you can succeed to suppress the initial "hump". But don't forget this is a hand-tuned and not general solution...
Another caveat of any sliding window solution: this always distorts your timing. Since you average over an interval in time depending on rising or falling signals your convoluted trace is shifted back/forth in time (slightly, but significantly). In your particular case this is not a problem since the main signal region has basically no time-dependence (very flat).
Frequency domain
This should be the silver bullet, but it also does not work well/easily for your example. The fact that this doesn't work better is the main hint to me that the first 40s of data are better discarded.... (i.e. in a scientific work)
You can use fast Fourier transform to inspect your data in frequency-domain.
import scipy.fft
y_fft = scipy.fft.rfft(y)
# original frequency domain plot
plt.plot(y_fft, label="original")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.show()
The structure in frequency represent the features of your data. The peak a zero is the stabilized region after ~100s, the humps are associated to (rapid) changes in time. You can now play around and change the frequency spectrum (--> filter) but I think the spectrum is so artificial that this doesn't yield great results here. Try it with other data and you may be very impressed! I tried two things, first cut high-frequency regions out (set to zero), and second, apply a sliding-window filter in frequency domain (sparing the peak at 0, since this cannot be touched. Try and you know why).
# cut high-frequency by setting to zero
y_fft_2 = np.array(y_fft)
y_fft_2[50:70] = 0
# sliding window in frequency
Nwindow = 15
Start = 10
y_fft_slide = np.array(y_fft)
y_fft_slide[Start:] = np.convolve(y_fft[Start:], np.ones((Nwindow,))/Nwindow, mode='same')
# frequency-domain plot
plt.plot(y_fft, label="original")
plt.plot(y_fft_2, label="high-frequency, filter")
plt.plot(y_fft_slide, label="frequency sliding window")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.legend()
plt.show()
Converting this back into time-domain:
# reverse FFT into time-domain for plotting
y_filtered = scipy.fft.irfft(y_fft_2)
y_filtered_slide = scipy.fft.irfft(y_fft_slide)
# time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_filtered[:500], label="high-f filtered")
plt.plot(x[:500], y_filtered_slide[:500], label="frequency sliding window")
# plt.xscale('log') # useful
plt.grid()
plt.legend()
plt.show()
yields
There are apparent oscillations in those solutions which make them essentially useless for your purpose. This leads me to my final exercise to again apply a sliding-window filter on the "frequency sliding window" time-domain
# extra time-domain sliding window
Nwindow=90
y_fft_90 = np.convolve(y_filtered_slide, np.ones((Nwindow,))/Nwindow, mode='same')
# final time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_fft_90[:500], label="frequency-sliding window, slide")
# plt.xscale('log') # useful
plt.legend()
plt.show()
I am quite happy with this result, but it still has very small oscillations and thus does not solve your original problem.
Conclusion
How much fun. One hour well wasted. Maybe it is useful to someone. Maybe even to you Natasha. Please be not mad a me...
Let's assume your data is in data variable and time indices are in time. Then
import numpy as np
threshold = 0.025
stable_index = np.where(np.abs(data[-1] - data) > threshold)[0][-1] + 1
print('Stabilizes after', time[stable_index], 'sec')
Stabilizes after 96.6 sec
Here data[-1] - data is a difference between last value of data and all the data values. The assumption here is that the last value of data represents the equilibrium point.
np.where( * > threshold )[0] are all the indices of values of data which are greater than the threshold, that is still not stabilized. We take only the last index. The next one is where time series is considered stabilized, hence the + 1.
If you're dealing with deterministic data which is eventually converging monotonically to some fixed value, the problem is pretty straightforward. Your last observation should be the closest to the limit, so you can define an acceptable tolerance threshold relative to that last data point and scan your data from back to front to find where you exceeded your threshold.
Things get a lot nastier once you add random noise into the picture, particularly if there is serial correlation. This problem is common in simulation modeling(see (*) below), and is known as the issue of initial bias. It was first identified by Conway in 1963, and has been an active area of research since then with no universally accepted definitive answer on how to deal with it. As with the deterministic case, the most widely accepted answers approach the problem starting from the right-hand side of the data set since this is where the data are most likely to be in steady state. Techniques based on this approach use the end of the dataset to establish some sort of statistical yardstick or baseline to measure where the data start looking significantly different as observations get added by moving towards the front of the dataset. This is greatly complicated by the presence of serial correlation.
If a time series is in steady state, in the sense of being covariance stationary then a simple average of the data is an unbiased estimate of its expected value, but the standard error of the estimated mean depends heavily on the serial correlation. The correct standard error squared is no longer s2/n, but instead it is (s2/n)*W where W is a properly weighted sum of the autocorrelation values. A method called MSER was developed in the 1990's, and avoids the issue of trying to correctly estimate W by trying to determine where the standard error is minimized. It treats W as a de-facto constant given a sufficiently large sample size, so if you consider the ratio of two standard error estimates the W's cancel out and the minimum occurs where s2/n is minimized. MSER proceeds as follows:
Starting from the end, calculate s2 for half of the data set to establish a baseline.
Now update the estimate of s2 one observation at a time using an efficient technique such as Welford's online algorithm, calculate s2/n where n is the number of observations tallied so far. Track which value of n yields the smallest s2/n. Lather, rinse, repeat.
Once you've traversed the entire data set from back to front, the n which yielded the smallest s2/n is the number of observations from the end of the data set which are not detectable as being biased by the starting conditions.
Justification - with a sufficiently large baseline (half your data), s2/n should be relatively stable as long as the time series remains in steady state. Since n is monotonically increasing, s2/n should continue decreasing subject to the limitations of its variability as an estimate. However, once you start acquiring observations which are not in steady state the drift in mean and variance will inflate the numerator of s2/n. Hence the minimal value corresponds to the last observation where there was no indication of non-stationarity. More details can be found in this proceedings paper. A Ruby implementation is available on BitBucket.
Your data has such a small amount of variation that MSER concludes that it is still converging to steady state. As such, I'd advise going with the deterministic approach outlined in the first paragraph. If you have noisy data in the future, I'd definitely suggest giving MSER a shot.
(*) - In a nutshell, a simulation model is a computer program and hence has to have its state set to some set of initial values. We generally don't know what the system state will look like in the long run, so we initialize it to an arbitrary but convenient set of values and then let the system "warm up". The problem is that the initial results of the simulation are not typical of the steady state behaviors, so including that data in your analyses will bias them. The solution is to remove the biased portion of the data, but how much should that be?
I have sampled sensor data for 1 minute with 5kHz sampling.
So, one sampled data file includes 5,000 x 60 = 300,000 data points.
Note that the sensor measures periodic data such as 60Hz AC current.
Now, I would like to apply FFT (using python numpy.rfft function) to the one data file.
As I know, the number of FFT results is half of the number of input data, i.e., 150,000 FFT results in the case of 300,000 data points.
However, the number of FFT results is too large to analyze them.
So, I would like to reduce the number of FFT results.
Regarding that, my question is that the following method valid given the one sampled data file?
Segment the one sampled data file into M segments
Apply FFT to each segment
Average the M FFT results to get one averaged FFT result
Use the average FFT result as FFT result of the given one sampled data file
Thank you in advance.
It depends on your purposes.
If source signal is sampled with 5 kHz, then frequency of max output element will corresponds to 2.5 kHz. So for 150K output length frequency resolution will about 0.017 Hz. If you apply transform to 3000 data points, you'll get freq.resolution 1.7 Hz.
Is this important for you? Do you need to register all possible frequency components of AC current?
AC quality (magnitude, frequency, noise) might vary during one-minute interval. Do you need to register such instability?
Perhaps, high freq. resolution and short-range temporal stability is not necessary for AC control, in this case you approach is quite well.
Edit: Longer interval also diminishes finite-duration signal windowing effect that gives false peaks
P.S. Note that fast Fourier transform usually (not always, I don't see such directions in rfft description) works with interval length = 2^N, so here output might contain 256K
I am doing some interesting experiments with audio and image files and Fast-Fourier Transforms (FFTs).
Fast Fourier Transforms are used in signal processing rather than other Fourier Transform algorithms because for large quantities of data they are the only (or one of the only) viable algorithm variants to use, as they scale as O(n log(n)), rather than n^2 as the naive implementation does.
The disadvantage is that the data must be stored in an array which has 2^n elements, for n integer.
When processing some data which does not have 2^n elements, the simple approach is to extend the array to be length 2^n and fill the "empty" elements with zero. (Assuming the mean value of the input signal is zero.)
I wrote a program to process some audio samples taken from WAV files. I tried implementing things such as a low-cut filter. In this case I found that my output signal (after doing the reverse transform) cuts to zero amplitude after a certain period of time. This is obviously not what one would expect of a low-pass filter.
I could dump my code at this point, but that is neither useful, nor legal as the source of my algorithm is a text-book with closed source code.
Instead I shall ask the following question.
Is packing out the array with zeros the best possible thing to do? Could this be causing my program to produce the unexpected results I am seeing? if I understand fourier mathematics correctly, having a bunch of zeros at the end of my array will introduce a large amount of low and high-frequency content as this essentially looks like a step-function (low frequency square wave). Should I be doing something else such as implementing my band-pass filter in a different way, for example, splitting the data into smaller groups of say 1024 samples and applying the FT, filter and IFT (inverse FT) to those small groups?
This question has been tagged with theory as it is not related to any specific programming language. (I assume that is the correct tag to use?)
Edit: It's now working beautifully, thanks all, I was able to pinpoint the 2 mistakes I made using the information below.
All finite length DFTs and FFT multiply longer data (longer source data or wav file than the FFT) with a rectangular window, which convolves the spectrum with a (periodic) Sinc function. Zero padding uses a shorter rectangular window, which results in the convolution of the spectrum with a wider Sinc function.
Filtering by multiplication of FFTs results in circular convolution, which wraps the impulse response of the filter around the FFT/IFFT result (e.g. the end of your filtered signal will interfere with the beginning of the filtered signal within the IFFT result). So you want to zero-pad your data before the FFT, and then see the impulse response of your filter go to zero at or before the very end of the filtered result (e.g. not wrap around). Look up the overlap-add and overlap-save algorithms, for using short FFTs for fast convolution filtering of longer signals, which take care of the filter impulse response extending into the zero-padded portion.
You can also use FFTs that are not a power of 2 in length. Any length that can be factored into small primes will work with most modern FFT libraries.
It depends what you are interested in.
If you are just interested in spectrum magnitude, then place the real data in the middle of the window to be processed. Just know that this time shift will put a phase shift into the spectrum result.
Regardless of the number of points, do not forget to place a window on your data. Wikipedia has a good write up on the windowing functions at https://en.wikipedia.org/wiki/Window_function.
If you do not perform some sort of windowing on your real world data, the padded signal will appear to have a step up and a step down at the end of the valid data (which puts a lot of noise into your spectrum giving you the false impression that you have a noise floor).
So, my recommendation, if you primarily care about magnitude:
- develop a hamming window for the number of points of valid data you have.
- apply the hamming window to the data you have
After that you have OPTIONS:
A) if your samples are slightly above a base two number, use the lower base two number (i.e. if you have 1400 points, do two 1024 point FFTs with overlap). The results of these two FFTs can be "smartly" combined for an aggregate spectrum. Depending on your fidelity needs, you can do this with more FFTs with a larger portion of overlapped data. Try to keep the overlap less that 10% to account for your window edges that will get attenuated by the start and end of the windowing functions.
B) place your windowed data anywhere in the FFT input vector (beginning, middle or end, it should only impact your phase results - which is why I asked if phase is important).
If it turns out phase is important, start your valid windowed data at the beginning of the FFT vector.
Regarding your spectrum observations (I just went through the same thing two weeks ago). If you are looking at a wave file converted from a lossy compression, you are going to be starting with a band limited signal, so expect the spectrum to do an abrupt drop. My first lossless wave file plot had a huge bald spot from Fs/10 -> 9Fs/10 (which is expected). For your plots - also display your data in logarithmic bins (linear bins will give you misleading info and squish the lower frequency elements which are the bulk of the signal in compressed music files).
FYI - I recommended hamming (because I did the same thing). A decoded compressed audio signal will only use a portion of your spectrum (decoding a 320kbps stream is sampled at 10Khz), even when decoded to 44.1Khz representation, all of the interesting data should be below 5Khz.
Best of luck
J.R.
P.S. this is my first post here, chime back if you want some pretty pictures from TeraPlot.
This is a question for http://dsp.stackexchange.com but yes, zero-padding is perfectly legitimate here.
Here’s why the filtered signal (once it’s back in the time-domain) goes to zero after some time: imagine linearly-convolving the zero-padded signal with your low-pass filter’s impulse response (using the slow O(N^2) time-domain filter implementation). The output will go to zero after the original signal is done, when the filter is just being fed with zeros, right? That result will be the same as the output of FFT-based fast convolution. It’s perfectly normal. Just crop the output signal to the same length of the input and move on with your life.
Caveat on FFT orders: just because power-of-two FFT lengths are “the fastest” in terms of operation count, while FFTs of lengths with low prime factors (3, 5, 7) have slightly higher operation counts, you may find that zero-padding to a low-prime-factor is faster in terms of real-world runtime because of memory costs. A pathological example: if you have a 1025-long signal, you probably don’t want to zero-pad to 2048 and eat the cost of allocating a nearly 2x memory buffer, and running a nearly 2x longer FFT. You’d try 1080-length FFT or something (1080 = 2^3 * 3^3 * 5: nextprod is your friend) and wouldn’t be surprised if it completed much faster than power-of-two.
I'm trying to find a pitch of a guitar string. Sound is coming in through mic at a sample rate of 44100. I'm using 2048 bites for a buffer size. Considering the Nyquist rate there is no point for using bigger buffer size. After recieving the data, I apply hanning window... and this is the point where I get confused. Should I use Lowpass filter in the time domain or take FFT first? If I would take FFT first, wouldn't it be easier to use just the first half of the samples, disregarding the other half, because I need frequencies in range of 50-1000? After FFT I will use Harmonic Product Spectrum to find fundamental frequency.
What you suggest makes some sense: if you don't need low frequencies you don't need to use long samples. With long samples you gain frequency resolution, which might be useful in some circumstances, but you lose time resolution (in the sense that successive samples are further apart).
A few things that don't make sense:
1) using a low-pass digital filter in the computation prior to the FFT (I'm assuming this is what you mean) just takes extra computation time and doesn't really gain you anything.
2) "Considering the Nyquist rate there is no point for using bigger buffer size": these aren't really related. The Nyquist rate determines the maximum frequency of the FFT, and the buffer size determines the frequency resolution, and therefore also the lowest frequency.
It really depends on your pitch detection algorithm, but why would you use a low-pass filter in the first place?
In addition, a guitar usually produces spectral information way beyond 1000Hz. Notes on the high E string easily produce harmonics at 4-5kHz and beyond, and these harmonics are exactly what will make your HPS nice and clear.
The less data used or the shorter your FFT, the lower the resulting FFT frequency resolution.
From what I read here a guitar ranges from 82.4 (open 6th string) to 659.2 (12th fret on 1st string) and the difference between the lowest 2 notes is about 5Hz.
If possible, I would apply an analog filter after the mic, but before the sampling circuit. Failing that, you would normally apply an FIR filter before shaping everything with the Hanning function. You could also use Decimation to reduce the sample rate, or simply choose a lower sample rate to start with.
Since you are doing an FFT anyway, simply throw away results above 1000 Hz. Sadly, you can't cut back on the number of samples - cutting the sample rate reduces frequency resolution.
2048 samples at 44100 Hz will give the same resolution as 1024 samples at 22050 Hz.
Which the same as 512 samples at 11025 Hz.
We use a data acquisition card to take readings from a device that increases its signal to a peak and then falls back to near the original value. To find the peak value we currently search the array for the highest reading and use the index to determine the timing of the peak value which is used in our calculations.
This works well if the highest value is the peak we are looking for but if the device is not working correctly we can see a second peak which can be higher than the initial peak. We take 10 readings a second from 16 devices over a 90 second period.
My initial thoughts are to cycle through the readings checking to see if the previous and next points are less than the current to find a peak and construct an array of peaks. Maybe we should be looking at a average of a number of points either side of the current position to allow for noise in the system. Is this the best way to proceed or are there better techniques?
We do use LabVIEW and I have checked the LAVA forums and there are a number of interesting examples. This is part of our test software and we are trying to avoid using too many non-standard VI libraries so I was hoping for feedback on the process/algorithms involved rather than specific code.
There are lots and lots of classic peak detection methods, any of which might work. You'll have to see what, in particular, bounds the quality of your data. Here are basic descriptions:
Between any two points in your data, (x(0), y(0)) and (x(n), y(n)), add up y(i + 1) - y(i) for 0 <= i < n and call this T ("travel") and set R ("rise") to y(n) - y(0) + k for suitably small k. T/R > 1 indicates a peak. This works OK if large travel due to noise is unlikely or if noise distributes symmetrically around a base curve shape. For your application, accept the earliest peak with a score above a given threshold, or analyze the curve of travel per rise values for more interesting properties.
Use matched filters to score similarity to a standard peak shape (essentially, use a normalized dot-product against some shape to get a cosine-metric of similarity)
Deconvolve against a standard peak shape and check for high values (though I often find 2 to be less sensitive to noise for simple instrumentation output).
Smooth the data and check for triplets of equally spaced points where, if x0 < x1 < x2, y1 > 0.5 * (y0 + y2), or check Euclidean distances like this: D((x0, y0), (x1, y1)) + D((x1, y1), (x2, y2)) > D((x0, y0),(x2, y2)), which relies on the triangle inequality. Using simple ratios will again provide you a scoring mechanism.
Fit a very simple 2-gaussian mixture model to your data (for example, Numerical Recipes has a nice ready-made chunk of code). Take the earlier peak. This will deal correctly with overlapping peaks.
Find the best match in the data to a simple Gaussian, Cauchy, Poisson, or what-have-you curve. Evaluate this curve over a broad range and subtract it from a copy of the data after noting it's peak location. Repeat. Take the earliest peak whose model parameters (standard deviation probably, but some applications might care about kurtosis or other features) meet some criterion. Watch out for artifacts left behind when peaks are subtracted from the data.
Best match might be determined by the kind of match scoring suggested in #2 above.
I've done what you're doing before: finding peaks in DNA sequence data, finding peaks in derivatives estimated from measured curves, and finding peaks in histograms.
I encourage you to attend carefully to proper baselining. Wiener filtering or other filtering or simple histogram analysis is often an easy way to baseline in the presence of noise.
Finally, if your data is typically noisy and you're getting data off the card as unreferenced single-ended output (or even referenced, just not differential), and if you're averaging lots of observations into each data point, try sorting those observations and throwing away the first and last quartile and averaging what remains. There are a host of such outlier elimination tactics that can be really useful.
You could try signal averaging, i.e. for each point, average the value with the surrounding 3 or more points. If the noise blips are huge, then even this may not help.
I realise that this was language agnostic, but guessing that you are using LabView, there are lots of pre-packaged signal processing VIs that come with LabView that you can use to do smoothing and noise reduction. The NI forums are a great place to get more specialised help on this sort of thing.
This problem has been studied in some detail.
There are a set of very up-to-date implementations in the TSpectrum* classes of ROOT (a nuclear/particle physics analysis tool). The code works in one- to three-dimensional data.
The ROOT source code is available, so you can grab this implementation if you want.
From the TSpectrum class documentation:
The algorithms used in this class have been published in the following references:
[1] M.Morhac et al.: Background
elimination methods for
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Physics Research A 401
(1997) 113-
132.
[2] M.Morhac et al.: Efficient one- and two-dimensional Gold
deconvolution and its application to
gamma-ray spectra decomposition.
Nuclear Instruments and Methods in
Physics Research A 401 (1997) 385-408.
[3] M.Morhac et al.: Identification of peaks in
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Research Physics A
443(2000), 108-125.
The papers are linked from the class documentation for those of you who don't have a NIM online subscription.
The short version of what is done is that the histogram flattened to eliminate noise, and then local maxima are detected by brute force in the flattened histogram.
I would like to contribute to this thread an algorithm that I have developed myself:
It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is very robust because it constructs a separate moving mean and deviation, such that signals do not corrupt the threshold. Future signals are therefore identified with approximately the same accuracy, regardless of the amount of previous signals. The algorithm takes 3 inputs: lag = the lag of the moving window, threshold = the z-score at which the algorithm signals and influence = the influence (between 0 and 1) of new signals on the mean and standard deviation. For example, a lag of 5 will use the last 5 observations to smooth the data. A threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold: an influence of 0 is therefore the most robust option.
It works as follows:
Pseudocode
# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function
# Settings (the ones below are examples: choose what is best for your data)
set lag to 5; # lag 5 for the smoothing functions
set threshold to 3.5; # 3.5 standard deviations for signal
set influence to 0.5; # between 0 and 1, where 1 is normal influence, 0.5 is half
# Initialise variables
set signals to vector 0,...,0 of length of y; # Initialise signal results
set filteredY to y(1,...,lag) # Initialise filtered series
set avgFilter to null; # Initialise average filter
set stdFilter to null; # Initialise std. filter
set avgFilter(lag) to mean(y(1,...,lag)); # Initialise first value
set stdFilter(lag) to std(y(1,...,lag)); # Initialise first value
for i=lag+1,...,t do
if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
if y(i) > avgFilter(i-1)
set signals(i) to +1; # Positive signal
else
set signals(i) to -1; # Negative signal
end
# Adjust the filters
set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
else
set signals(i) to 0; # No signal
# Adjust the filters
set filteredY(i) to y(i);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
end
end
Demo
> For more information, see original answer
This method is basically from David Marr's book "Vision"
Gaussian blur your signal with the expected width of your peaks.
this gets rid of noise spikes and your phase data is undamaged.
Then edge detect (LOG will do)
Then your edges were the edges of features (like peaks).
look between edges for peaks, sort peaks by size, and you're done.
I have used variations on this and they work very well.
I think you want to cross-correlate your signal with an expected, exemplar signal. But, it has been such a long time since I studied signal processing and even then I didn't take much notice.
I don't know very much about instrumentation, so this might be totally impractical, but then again it might be a helpful different direction. If you know how the readings can fail, and there is a certain interval between peaks given such failures, why not do gradient descent at each interval. If the descent brings you back to an area you've searched before, you can abandon it. Depending upon the shape of the sampled surface, this also might help you find peaks faster than search.
Is there a qualitative difference between the desired peak and the unwanted second peak? If both peaks are "sharp" -- i.e. short in time duration -- when looking at the signal in the frequency domain (by doing FFT) you'll get energy at most bands. But if the "good" peak reliably has energy present at frequencies not existing in the "bad" peak, or vice versa, you may be able to automatically differentiate them that way.
You could apply some Standard Deviation to your logic and take notice of peaks over x%.