Finding the time in which a specific value is reached in time-series data when peaks are found - algorithm

I would like to find the time instant at which a certain value is reached in a time-series data with noise. If there are no peaks in the data, I could do the following in MATLAB.
Code from here
% create example data
d=1:100;
t=d/100;
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold
But when there is noise, I am not sure what has to be done.
In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a peak that reaches 5 somewhere around 20 s . I would like to know how to detect e.g 100 seconds as the right time and not 20 s . The code posted above will only give 20 s as the answer. I
saw a post here that explains using a sliding window to find when the data equilibrates. However, I am not sure how to implement the same. Suggestions will be really helpful.
The sample data plotted in the above image can be found here
Suggestions on how to implement in Python or MATLAB code will be really helpful.
EDIT:
I don't want to capture when the peak (/noise/overshoot) occurs. I want to find the time when equilibrium is reached. For example, around 20 s the curve rises and dips below 5. After ~100 s the curve equilibrates to a steady-state value 5 and never dips or peaks.

Precise data analysis is a serious business (and my passion) that involves a lot of understanding of the system you are studying. Here are comments, unfortunately I doubt there is a simple nice answer to your problem at all -- you will have to think about it. Data analysis basically always requires "discussion".
First to your data and problem in general:
When you talk about noise, in data analysis this means a statistical random fluctuation. Most often Gaussian (sometimes also other distributions, e.g. Poission). Gaussian noise is a) random in each bin and b) symmetric in negative and positive direction. Thus, what you observe in the peak at ~20s is not noise. It has a very different, very systematic and extended characteristics compared to random noise. This is an "artifact" that must have a origin, but of which we can only speculate here. In real-world applications, studying and removing such artifacts is the most expensive and time-consuming task.
Looking at your data, the random noise is negligible. This is very precise data. For example, after ~150s and later there are no visible random fluctuations up to fourth decimal number.
After concluding that this is not noise in the common sense it could be a least two things: a) a feature of the system you are studying, thus, something where you could develop a model/formula for and which you could "fit" to the data. b) a characteristics of limited bandwidth somewhere in the measurement chain, thus, here a high-frequency cutoff. See e.g. https://en.wikipedia.org/wiki/Ringing_artifacts . Unfortunately, for both, a and b, there are no catch-all generic solutions. And your problem description (even with code and data) is not sufficient to propose an ideal approach.
After spending now ~one hour on your data and making some plots. I believe (speculate) that the extremely sharp feature at ~10s cannot be a "physical" property of the data. It simply is too extreme/steep. Something fundamentally happened here. A guess of mine could be that some device was just switched on (was off before). Thus, the data before is meaningless, and there is a short period of time afterwards to stabilize the system. There is not really an alternative in this scenario but to entirely discard the data until the system has stabilized at around 40s. This also makes your problem trivial. Just delete the first 40s, then the maximum becomes evident.
So what are technical solutions you could use, please don't be too upset that you have to think about this yourself and assemble the best possible solution for your case. I copied your data in two numpy arrays x and y and ran the following test in python:
Remove unstable time
This is the trivial solution -- I prefer it.
plt.figure()
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x, y, label="original")
y_cut = y
y_cut[:40] = 0
plt.plot(x, y_cut, label="cut 40s")
plt.legend()
plt.grid()
plt.show()
Note carry on reading below only if you are a bit crazy (about data).
Sliding window
You mentioned "sliding window" which is best suited for random noise (which you don't have) or periodic fluctuations (which you also don't really have). Sliding window just averages over consecutive bins, averaging out random fluctuations. Mathematically this is a convolution.
Technically, you can actually solve your problem like this (try even larger values of Nwindow yourself):
Nwindow=10
y_slide_10 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=20
y_slide_20 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
Nwindow=30
y_slide_30 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
plt.xlabel('time')
plt.ylabel('signal')
plt.plot(x,y, label="original")
plt.plot(x,y_slide_10, label="window=10")
plt.plot(x,y_slide_20, label='window=20')
plt.plot(x,y_slide_30, label='window=30')
plt.legend()
#plt.xscale('log') # useful
plt.grid()
plt.show()
Thus, technically you can succeed to suppress the initial "hump". But don't forget this is a hand-tuned and not general solution...
Another caveat of any sliding window solution: this always distorts your timing. Since you average over an interval in time depending on rising or falling signals your convoluted trace is shifted back/forth in time (slightly, but significantly). In your particular case this is not a problem since the main signal region has basically no time-dependence (very flat).
Frequency domain
This should be the silver bullet, but it also does not work well/easily for your example. The fact that this doesn't work better is the main hint to me that the first 40s of data are better discarded.... (i.e. in a scientific work)
You can use fast Fourier transform to inspect your data in frequency-domain.
import scipy.fft
y_fft = scipy.fft.rfft(y)
# original frequency domain plot
plt.plot(y_fft, label="original")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.show()
The structure in frequency represent the features of your data. The peak a zero is the stabilized region after ~100s, the humps are associated to (rapid) changes in time. You can now play around and change the frequency spectrum (--> filter) but I think the spectrum is so artificial that this doesn't yield great results here. Try it with other data and you may be very impressed! I tried two things, first cut high-frequency regions out (set to zero), and second, apply a sliding-window filter in frequency domain (sparing the peak at 0, since this cannot be touched. Try and you know why).
# cut high-frequency by setting to zero
y_fft_2 = np.array(y_fft)
y_fft_2[50:70] = 0
# sliding window in frequency
Nwindow = 15
Start = 10
y_fft_slide = np.array(y_fft)
y_fft_slide[Start:] = np.convolve(y_fft[Start:], np.ones((Nwindow,))/Nwindow, mode='same')
# frequency-domain plot
plt.plot(y_fft, label="original")
plt.plot(y_fft_2, label="high-frequency, filter")
plt.plot(y_fft_slide, label="frequency sliding window")
plt.xlabel('frequency')
plt.ylabel('signal')
plt.yscale('log')
plt.legend()
plt.show()
Converting this back into time-domain:
# reverse FFT into time-domain for plotting
y_filtered = scipy.fft.irfft(y_fft_2)
y_filtered_slide = scipy.fft.irfft(y_fft_slide)
# time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_filtered[:500], label="high-f filtered")
plt.plot(x[:500], y_filtered_slide[:500], label="frequency sliding window")
# plt.xscale('log') # useful
plt.grid()
plt.legend()
plt.show()
yields
There are apparent oscillations in those solutions which make them essentially useless for your purpose. This leads me to my final exercise to again apply a sliding-window filter on the "frequency sliding window" time-domain
# extra time-domain sliding window
Nwindow=90
y_fft_90 = np.convolve(y_filtered_slide, np.ones((Nwindow,))/Nwindow, mode='same')
# final time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_fft_90[:500], label="frequency-sliding window, slide")
# plt.xscale('log') # useful
plt.legend()
plt.show()
I am quite happy with this result, but it still has very small oscillations and thus does not solve your original problem.
Conclusion
How much fun. One hour well wasted. Maybe it is useful to someone. Maybe even to you Natasha. Please be not mad a me...

Let's assume your data is in data variable and time indices are in time. Then
import numpy as np
threshold = 0.025
stable_index = np.where(np.abs(data[-1] - data) > threshold)[0][-1] + 1
print('Stabilizes after', time[stable_index], 'sec')
Stabilizes after 96.6 sec
Here data[-1] - data is a difference between last value of data and all the data values. The assumption here is that the last value of data represents the equilibrium point.
np.where( * > threshold )[0] are all the indices of values of data which are greater than the threshold, that is still not stabilized. We take only the last index. The next one is where time series is considered stabilized, hence the + 1.

If you're dealing with deterministic data which is eventually converging monotonically to some fixed value, the problem is pretty straightforward. Your last observation should be the closest to the limit, so you can define an acceptable tolerance threshold relative to that last data point and scan your data from back to front to find where you exceeded your threshold.
Things get a lot nastier once you add random noise into the picture, particularly if there is serial correlation. This problem is common in simulation modeling(see (*) below), and is known as the issue of initial bias. It was first identified by Conway in 1963, and has been an active area of research since then with no universally accepted definitive answer on how to deal with it. As with the deterministic case, the most widely accepted answers approach the problem starting from the right-hand side of the data set since this is where the data are most likely to be in steady state. Techniques based on this approach use the end of the dataset to establish some sort of statistical yardstick or baseline to measure where the data start looking significantly different as observations get added by moving towards the front of the dataset. This is greatly complicated by the presence of serial correlation.
If a time series is in steady state, in the sense of being covariance stationary then a simple average of the data is an unbiased estimate of its expected value, but the standard error of the estimated mean depends heavily on the serial correlation. The correct standard error squared is no longer s2/n, but instead it is (s2/n)*W where W is a properly weighted sum of the autocorrelation values. A method called MSER was developed in the 1990's, and avoids the issue of trying to correctly estimate W by trying to determine where the standard error is minimized. It treats W as a de-facto constant given a sufficiently large sample size, so if you consider the ratio of two standard error estimates the W's cancel out and the minimum occurs where s2/n is minimized. MSER proceeds as follows:
Starting from the end, calculate s2 for half of the data set to establish a baseline.
Now update the estimate of s2 one observation at a time using an efficient technique such as Welford's online algorithm, calculate s2/n where n is the number of observations tallied so far. Track which value of n yields the smallest s2/n. Lather, rinse, repeat.
Once you've traversed the entire data set from back to front, the n which yielded the smallest s2/n is the number of observations from the end of the data set which are not detectable as being biased by the starting conditions.
Justification - with a sufficiently large baseline (half your data), s2/n should be relatively stable as long as the time series remains in steady state. Since n is monotonically increasing, s2/n should continue decreasing subject to the limitations of its variability as an estimate. However, once you start acquiring observations which are not in steady state the drift in mean and variance will inflate the numerator of s2/n. Hence the minimal value corresponds to the last observation where there was no indication of non-stationarity. More details can be found in this proceedings paper. A Ruby implementation is available on BitBucket.
Your data has such a small amount of variation that MSER concludes that it is still converging to steady state. As such, I'd advise going with the deterministic approach outlined in the first paragraph. If you have noisy data in the future, I'd definitely suggest giving MSER a shot.
(*) - In a nutshell, a simulation model is a computer program and hence has to have its state set to some set of initial values. We generally don't know what the system state will look like in the long run, so we initialize it to an arbitrary but convenient set of values and then let the system "warm up". The problem is that the initial results of the simulation are not typical of the steady state behaviors, so including that data in your analyses will bias them. The solution is to remove the biased portion of the data, but how much should that be?

Related

How to measure "homogeneity" of time series?

I have two time series, see this pic:
I need to measure the level of "homogeneity" of the series. So the first one looks very fragmented, so it should have low value close to zero and the second one should have a high value.
Any ideas of an algorithm I could use?
I'm not sure what is meant by homogeneity, but there is a well-established notion of stationarity of a time series. Basically, a time series is stationary if its rolling mean and standard deviation are constant across time. Both of your time series seem to have roughly constant mean, but the top one has a standard deviation that changes wildly across time; sometimes it's almost zero, and at other times, it's very large. Perhaps you could take the standard deviation of the rolling standard deviation, which will be far higher for the top series than for the bottom. If you can load them into pandas as top and bottom, it might look like
top_nonstationarity = np.std(top.rolling(window_size).std())
bottom_nonstationarity = np.std(bottom.rolling(window_size).std())
It might help to know more about the underlying difference between the series, or what you care about, but here goes...
I would subtract constants, if required, to give both series mean zero, and then square them to get something resembling power and filter this enough to smooth away what seems to be noise in the case of the lower filter. Then compute and compare the variances of the two filtered powers, which for the lower time series I would now expect to be a fairly constant line with a few drops down and for the upper series something spending about half of its time near zero and about half of its time away from it.
Possible filters include a simple moving average, whatever your time series toolkit provides, and those described at https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.
I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!
Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

How can I detect these audio abnormalities?

iOS has an issue recording through some USB audio devices. It cannot be reliably reproduced (happens every 1 in ~2000-3000 records in batches and silently disappears), and we currently manually check our audio for any recording issues. It results in small numbers of samples (1-20) being shifted by a small number that sounds like a sort of 'crackle'.
They look like this:
closer:
closer:
another, single sample error elsewhere in the same audio file:
The question is, how can these be algorithmically be detected (assuming direct access to samples) whilst not triggering false positives on high frequency audio with waveforms like this:
Bonus points: after determining as many errors as possible, how can the audio be 'fixed'?
Dirty audio file - pictured
Another dirty audio file
Clean audio with valid high frequency - pictured
More bonus points: what could be causing this issue in the iOS USB audio drivers/hardware (assuming it is there).
I do not think there is an out of the box solution to find the disturbances, but here is one (non standard) way of tackling the problem. Using this, I could find most intervals and I only got a small number of false positives, but the algorithm could certainly use some fine tuning.
My idea is to find the start and end point of the deviating samples. The first step should be to make these points stand out more clearly. This can be done by taking the logarithm of the data and taking the differences between consecutive values.
In MATLAB I load the data (in this example I use dirty-sample-other.wav)
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
and use the following code:
logdata = log(1+data);
difflogdata = diff(logdata);
So instead of this plot of the original data:
we get:
where the intervals we are looking for stand out as a positive and negative spike. For example zooming in on the largest positive value in the plot of logarithm differences we get the following two figures. One for the original data:
and one for the difference of logarithms:
This plot could help with finding the areas manually but ideally we want to find them using an algorithm. The way I did this was to take a moving window of size 6, computing the mean value of the window (of all points except the minimum value), and compare this to the maximum value. If the maximum point is the only point that is above the mean value and at least twice as large as the mean it is counted as a positive extreme value.
I then used a threshold of counts, at least half of the windows moving over the value should detect it as an extreme value in order for it to be accepted.
Multiplying all points with (-1) this algorithm is then run again to detect the minimum values.
Marking the positive extremes with "o" and negative extremes with "*" we get the following two plots. One for the differences of logarithms:
and one for the original data:
Zooming in on the left part of the figure showing the logarithmic differences we can see that most extreme values are found:
It seems like most intervals are found and there are only a small number of false positives. For example running the algorithm on 'clean-highfreq.wav' I only find one positive and one negative extreme value.
Single values that are falsely classified as extreme values could perhaps be weeded out by matching start and end-points. And if you want to replace the lost data you could use some kind of interpolation using the surrounding data-points, perhaps even a linear interpolation will be good enough.
Here is the MATLAB-code I used:
function test20()
clc
clear all
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
logdata = log(1+data);
difflogdata = diff(logdata);
figure,plot(data),hold on,plot(data,'.')
figure,plot(difflogdata),hold on,plot(difflogdata,'.')
figure,plot(data),hold on,plot(data,'.'),xlim([68000,68200])
figure,plot(difflogdata),hold on,plot(difflogdata,'.'),xlim([68000,68200])
k = 6;
myData = difflogdata;
myPoints = findPoints(myData,k);
myData2 = -difflogdata;
myPoints2 = findPoints(myData2,k);
figure
plotterFunction(difflogdata,myPoints>=k,'or')
hold on
plotterFunction(difflogdata,myPoints2>=k,'*r')
figure
plotterFunction(data,myPoints>=k,'or')
hold on
plotterFunction(data,myPoints2>=k,'*r')
end
function myPoints = findPoints(myData,k)
iterationVector = k+1:length(myData);
myPoints = zeros(size(myData));
for i = iterationVector
subVector = myData(i-k:i);
meanSubVector = mean(subVector(subVector>min(subVector)));
[maxSubVector, maxIndex] = max(subVector);
if (sum(subVector>meanSubVector) == 1 && maxSubVector>2*meanSubVector)
myPoints(i-k-1+maxIndex) = myPoints(i-k-1+maxIndex) +1;
end
end
end
function plotterFunction(allPoints,extremeIndices,markerType)
extremePoints = NaN(size(allPoints));
extremePoints(extremeIndices) = allPoints(extremeIndices);
plot(extremePoints,markerType,'MarkerSize',15),
hold on
plot(allPoints,'.')
plot(allPoints)
end
Edit - comments on recovering the original data
Here is a slightly zoomed out view of figure three above: (the disturbance is between 6.8 and 6.82)
When I examine the values, your theory about the data being mirrored to negative values does not seem to fit the pattern exactly. But in any case, my thought about just removing the differences is certainly not correct. Since the surrounding points do not seem to be altered by the disturbance, I would probably go back to the original idea of not trusting the points within the affected region and instead using some sort of interpolation using the surrounding data. It seems like a simple linear interpolation would be a quite good approximation in most cases.
To answer the question of why it happens -
A USB audio device and host are not clock synchronous - that is to say that the host cannot accurately recover the relationship between the host's local clock and the word-clock of the ADC/DAC on the audio interface. Various techniques do exist for clock-recovery with various degrees of effectiveness. To add to the problem, the bus clock is likely to be unrelated to either of the two audio clocks.
Whilst you might imagine this not to be too much of a concern for audio receive - audio capture callbacks could happen when there is data - audio interfaces are usually bi-directional and the host will be rendering audio at regular interval, which the other end is potentially consuming at a slightly different rate.
In-between are several sets of buffers, which can over- or under-run, which is what looks to be happening here; the interval between it happening certainly seems about right.
You might find that changing USB audio device to one built around a different chip-set (or, simply a different local oscillator) helps.
As an aside both IEEE1394 audio and MPEG transport streams have the same clock recovery requirement. Both of them solve the problem with by embedding a local clock reference packet into the serial bitstream in a very predictable way which allows accurate clock recovery on the other end.
I think the following algorithm can be applied to samples in order to determine a potential false positive:
First, scan for high amount of high frequency, either via FFT'ing the sound block by block (256 values maybe), or by counting the consecutive samples above and below zero. The latter should keep track of maximum consecutive above zero, maximum consecutive below zero, the amount of small transitions around zero and the current volume of the block (0..1 as Audacity displays it). Then, if the maximum consecutive is below 5 (sampling at 44100, and zeroes be consecutive, while outstsanding samples are single, 5 responds to 4410Hz frequency, which is pretty high), or the sum of small transitions' lengths is above a certain value depending on maximum consecutive (I believe the first approximation would be 3*5*block size/distance between two maximums, which roughly equates to period of the loudest FFT frequency. Also it should be measured both above and below threshold, as we can end up with an erroneous peak, which will likely be detected by difference between main tempo measured on below-zero or above-zero maximums, also by std-dev of peaks. If high frequency is dominant, this block is eligible only for zero-value testing, and a special means to repair the data will be needed. If high frequency is significant, that is, there is a dominant low frequency detected, we can search for peaks bigger than 3.0*high frequency volume, as well as abnormal zeroes in this block.
Also, your gaps seem to be either highly extending or plain zero, with high extends to be single errors, and zero errors range from 1-20. So, if there is a zero range with values under 0.02 absolute value, which is directly surrounded by values of 0.15 (a variable to be finetuned) or higher absolute value AND of the same sign, count this point as an error. Single values that stand out can be detected if you calculate 2.0*(current sample)-(previous sample)-(next sample) and if it's above a certain threshold (0.1+high frequency volume, or 3.0*high frequency volume, whichever is bigger), count this as an error and average.
What to do with zero gaps found - we can copy values from 1 period backwards and 1 period forwards (averaging), where "period" is of the most significant frequency of the FFT of the block. If the "period" is smaller than the gap (say we've detected a gap of zeroes in a high-pitched part of the sound), use two or more periods, so the source data will all be valid (in this case, no averaging can be done, as it's possible that the signal 2 periods forward from the gap and 2 periods back will be in counterphase). If there are more than one frequency of about equal amplitude, we can plain sample these with correct phases, cutting the rest of less significant frequencies altogether.
The outstanding sample should IMO just be averaged by 2-4 surrounding samples, as there seems to be only a single sample ever encountered in your sound files.
The discrete wavelet transform (DWT) may be the solution to your problem.
A FFT calculation is not very useful in your case since its an average representation of relative frequency content over the entire duration of the signal, and thus impossible to detect momentary changes. The dicrete short time frequency transform (STFT) tries to tackle this by computing the DFT for short consecutive time-blocks of the signal, the length of which is determine by the length (and shape) of a window, but since the resolution of the DFT is dependent on the data/block-length, there is a trade-off between resolution in freqency OR in time, and finding this magical fixed window-size can be tricky!
What you want is a time-frequency analysis method with good time resolution for high-frequency events, and good frequency resolution for low-frequency events... Enter the discrete wavelet transform!
There are numerous wavelet transforms for different applications and as you might expect, it's computationally heavy. The DWT may not be practical solution to your problem, but it's worth considering. Good luck with your problem. Some friday-evening reading:
http://klapetek.cz/wdwt.html
http://etd.lib.fsu.edu/theses/available/etd-11242003-185039/unrestricted/09_ds_chapter2.pdf
http://en.wikipedia.org/wiki/Wavelet_transform
http://en.wikipedia.org/wiki/Discrete_wavelet_transform
You can try the following super-simple approach (maybe it's enough):
Take each point in your wave-form and subtract its predecessor (look at the changes from one point to the next).
Look at the distribution of these changes and find their standard deviation.
If any given difference is beyond X times this standard deviation (either above or below), flag it as a problem.
Determine the best value for X by playing with it and seeing how well it performs.
Most "problems" should come as a pair of two differences beyond your cutoff, one going up, and one going back down.
To stick with the super-simple approach, you can then fix the data by just interpolating linearly between the last good point before your problem-section and the first good point after. (Make sure you don't just delete the points as this will influence (raise) the pitch of your audio.)

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...

Peak detection of measured signal

We use a data acquisition card to take readings from a device that increases its signal to a peak and then falls back to near the original value. To find the peak value we currently search the array for the highest reading and use the index to determine the timing of the peak value which is used in our calculations.
This works well if the highest value is the peak we are looking for but if the device is not working correctly we can see a second peak which can be higher than the initial peak. We take 10 readings a second from 16 devices over a 90 second period.
My initial thoughts are to cycle through the readings checking to see if the previous and next points are less than the current to find a peak and construct an array of peaks. Maybe we should be looking at a average of a number of points either side of the current position to allow for noise in the system. Is this the best way to proceed or are there better techniques?
We do use LabVIEW and I have checked the LAVA forums and there are a number of interesting examples. This is part of our test software and we are trying to avoid using too many non-standard VI libraries so I was hoping for feedback on the process/algorithms involved rather than specific code.
There are lots and lots of classic peak detection methods, any of which might work. You'll have to see what, in particular, bounds the quality of your data. Here are basic descriptions:
Between any two points in your data, (x(0), y(0)) and (x(n), y(n)), add up y(i + 1) - y(i) for 0 <= i < n and call this T ("travel") and set R ("rise") to y(n) - y(0) + k for suitably small k. T/R > 1 indicates a peak. This works OK if large travel due to noise is unlikely or if noise distributes symmetrically around a base curve shape. For your application, accept the earliest peak with a score above a given threshold, or analyze the curve of travel per rise values for more interesting properties.
Use matched filters to score similarity to a standard peak shape (essentially, use a normalized dot-product against some shape to get a cosine-metric of similarity)
Deconvolve against a standard peak shape and check for high values (though I often find 2 to be less sensitive to noise for simple instrumentation output).
Smooth the data and check for triplets of equally spaced points where, if x0 < x1 < x2, y1 > 0.5 * (y0 + y2), or check Euclidean distances like this: D((x0, y0), (x1, y1)) + D((x1, y1), (x2, y2)) > D((x0, y0),(x2, y2)), which relies on the triangle inequality. Using simple ratios will again provide you a scoring mechanism.
Fit a very simple 2-gaussian mixture model to your data (for example, Numerical Recipes has a nice ready-made chunk of code). Take the earlier peak. This will deal correctly with overlapping peaks.
Find the best match in the data to a simple Gaussian, Cauchy, Poisson, or what-have-you curve. Evaluate this curve over a broad range and subtract it from a copy of the data after noting it's peak location. Repeat. Take the earliest peak whose model parameters (standard deviation probably, but some applications might care about kurtosis or other features) meet some criterion. Watch out for artifacts left behind when peaks are subtracted from the data.
Best match might be determined by the kind of match scoring suggested in #2 above.
I've done what you're doing before: finding peaks in DNA sequence data, finding peaks in derivatives estimated from measured curves, and finding peaks in histograms.
I encourage you to attend carefully to proper baselining. Wiener filtering or other filtering or simple histogram analysis is often an easy way to baseline in the presence of noise.
Finally, if your data is typically noisy and you're getting data off the card as unreferenced single-ended output (or even referenced, just not differential), and if you're averaging lots of observations into each data point, try sorting those observations and throwing away the first and last quartile and averaging what remains. There are a host of such outlier elimination tactics that can be really useful.
You could try signal averaging, i.e. for each point, average the value with the surrounding 3 or more points. If the noise blips are huge, then even this may not help.
I realise that this was language agnostic, but guessing that you are using LabView, there are lots of pre-packaged signal processing VIs that come with LabView that you can use to do smoothing and noise reduction. The NI forums are a great place to get more specialised help on this sort of thing.
This problem has been studied in some detail.
There are a set of very up-to-date implementations in the TSpectrum* classes of ROOT (a nuclear/particle physics analysis tool). The code works in one- to three-dimensional data.
The ROOT source code is available, so you can grab this implementation if you want.
From the TSpectrum class documentation:
The algorithms used in this class have been published in the following references:
[1] M.Morhac et al.: Background
elimination methods for
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Physics Research A 401
(1997) 113-
132.
[2] M.Morhac et al.: Efficient one- and two-dimensional Gold
deconvolution and its application to
gamma-ray spectra decomposition.
Nuclear Instruments and Methods in
Physics Research A 401 (1997) 385-408.
[3] M.Morhac et al.: Identification of peaks in
multidimensional coincidence gamma-ray
spectra. Nuclear Instruments and
Methods in Research Physics A
443(2000), 108-125.
The papers are linked from the class documentation for those of you who don't have a NIM online subscription.
The short version of what is done is that the histogram flattened to eliminate noise, and then local maxima are detected by brute force in the flattened histogram.
I would like to contribute to this thread an algorithm that I have developed myself:
It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is very robust because it constructs a separate moving mean and deviation, such that signals do not corrupt the threshold. Future signals are therefore identified with approximately the same accuracy, regardless of the amount of previous signals. The algorithm takes 3 inputs: lag = the lag of the moving window, threshold = the z-score at which the algorithm signals and influence = the influence (between 0 and 1) of new signals on the mean and standard deviation. For example, a lag of 5 will use the last 5 observations to smooth the data. A threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold: an influence of 0 is therefore the most robust option.
It works as follows:
Pseudocode
# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function
# Settings (the ones below are examples: choose what is best for your data)
set lag to 5; # lag 5 for the smoothing functions
set threshold to 3.5; # 3.5 standard deviations for signal
set influence to 0.5; # between 0 and 1, where 1 is normal influence, 0.5 is half
# Initialise variables
set signals to vector 0,...,0 of length of y; # Initialise signal results
set filteredY to y(1,...,lag) # Initialise filtered series
set avgFilter to null; # Initialise average filter
set stdFilter to null; # Initialise std. filter
set avgFilter(lag) to mean(y(1,...,lag)); # Initialise first value
set stdFilter(lag) to std(y(1,...,lag)); # Initialise first value
for i=lag+1,...,t do
if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
if y(i) > avgFilter(i-1)
set signals(i) to +1; # Positive signal
else
set signals(i) to -1; # Negative signal
end
# Adjust the filters
set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
else
set signals(i) to 0; # No signal
# Adjust the filters
set filteredY(i) to y(i);
set avgFilter(i) to mean(filteredY(i-lag,i),lag);
set stdFilter(i) to std(filteredY(i-lag,i),lag);
end
end
Demo
> For more information, see original answer
This method is basically from David Marr's book "Vision"
Gaussian blur your signal with the expected width of your peaks.
this gets rid of noise spikes and your phase data is undamaged.
Then edge detect (LOG will do)
Then your edges were the edges of features (like peaks).
look between edges for peaks, sort peaks by size, and you're done.
I have used variations on this and they work very well.
I think you want to cross-correlate your signal with an expected, exemplar signal. But, it has been such a long time since I studied signal processing and even then I didn't take much notice.
I don't know very much about instrumentation, so this might be totally impractical, but then again it might be a helpful different direction. If you know how the readings can fail, and there is a certain interval between peaks given such failures, why not do gradient descent at each interval. If the descent brings you back to an area you've searched before, you can abandon it. Depending upon the shape of the sampled surface, this also might help you find peaks faster than search.
Is there a qualitative difference between the desired peak and the unwanted second peak? If both peaks are "sharp" -- i.e. short in time duration -- when looking at the signal in the frequency domain (by doing FFT) you'll get energy at most bands. But if the "good" peak reliably has energy present at frequencies not existing in the "bad" peak, or vice versa, you may be able to automatically differentiate them that way.
You could apply some Standard Deviation to your logic and take notice of peaks over x%.

Resources