In matlab, speed up cross correlation - performance

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.

I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!

Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

Related

How to measure "homogeneity" of time series?

I have two time series, see this pic:
I need to measure the level of "homogeneity" of the series. So the first one looks very fragmented, so it should have low value close to zero and the second one should have a high value.
Any ideas of an algorithm I could use?
I'm not sure what is meant by homogeneity, but there is a well-established notion of stationarity of a time series. Basically, a time series is stationary if its rolling mean and standard deviation are constant across time. Both of your time series seem to have roughly constant mean, but the top one has a standard deviation that changes wildly across time; sometimes it's almost zero, and at other times, it's very large. Perhaps you could take the standard deviation of the rolling standard deviation, which will be far higher for the top series than for the bottom. If you can load them into pandas as top and bottom, it might look like
top_nonstationarity = np.std(top.rolling(window_size).std())
bottom_nonstationarity = np.std(bottom.rolling(window_size).std())
It might help to know more about the underlying difference between the series, or what you care about, but here goes...
I would subtract constants, if required, to give both series mean zero, and then square them to get something resembling power and filter this enough to smooth away what seems to be noise in the case of the lower filter. Then compute and compare the variances of the two filtered powers, which for the lower time series I would now expect to be a fairly constant line with a few drops down and for the upper series something spending about half of its time near zero and about half of its time away from it.
Possible filters include a simple moving average, whatever your time series toolkit provides, and those described at https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter

How do fourier processing algorithms deal with "data edges"

I am doing some interesting experiments with audio and image files and Fast-Fourier Transforms (FFTs).
Fast Fourier Transforms are used in signal processing rather than other Fourier Transform algorithms because for large quantities of data they are the only (or one of the only) viable algorithm variants to use, as they scale as O(n log(n)), rather than n^2 as the naive implementation does.
The disadvantage is that the data must be stored in an array which has 2^n elements, for n integer.
When processing some data which does not have 2^n elements, the simple approach is to extend the array to be length 2^n and fill the "empty" elements with zero. (Assuming the mean value of the input signal is zero.)
I wrote a program to process some audio samples taken from WAV files. I tried implementing things such as a low-cut filter. In this case I found that my output signal (after doing the reverse transform) cuts to zero amplitude after a certain period of time. This is obviously not what one would expect of a low-pass filter.
I could dump my code at this point, but that is neither useful, nor legal as the source of my algorithm is a text-book with closed source code.
Instead I shall ask the following question.
Is packing out the array with zeros the best possible thing to do? Could this be causing my program to produce the unexpected results I am seeing? if I understand fourier mathematics correctly, having a bunch of zeros at the end of my array will introduce a large amount of low and high-frequency content as this essentially looks like a step-function (low frequency square wave). Should I be doing something else such as implementing my band-pass filter in a different way, for example, splitting the data into smaller groups of say 1024 samples and applying the FT, filter and IFT (inverse FT) to those small groups?
This question has been tagged with theory as it is not related to any specific programming language. (I assume that is the correct tag to use?)
Edit: It's now working beautifully, thanks all, I was able to pinpoint the 2 mistakes I made using the information below.
All finite length DFTs and FFT multiply longer data (longer source data or wav file than the FFT) with a rectangular window, which convolves the spectrum with a (periodic) Sinc function. Zero padding uses a shorter rectangular window, which results in the convolution of the spectrum with a wider Sinc function.
Filtering by multiplication of FFTs results in circular convolution, which wraps the impulse response of the filter around the FFT/IFFT result (e.g. the end of your filtered signal will interfere with the beginning of the filtered signal within the IFFT result). So you want to zero-pad your data before the FFT, and then see the impulse response of your filter go to zero at or before the very end of the filtered result (e.g. not wrap around). Look up the overlap-add and overlap-save algorithms, for using short FFTs for fast convolution filtering of longer signals, which take care of the filter impulse response extending into the zero-padded portion.
You can also use FFTs that are not a power of 2 in length. Any length that can be factored into small primes will work with most modern FFT libraries.
It depends what you are interested in.
If you are just interested in spectrum magnitude, then place the real data in the middle of the window to be processed. Just know that this time shift will put a phase shift into the spectrum result.
Regardless of the number of points, do not forget to place a window on your data. Wikipedia has a good write up on the windowing functions at https://en.wikipedia.org/wiki/Window_function.
If you do not perform some sort of windowing on your real world data, the padded signal will appear to have a step up and a step down at the end of the valid data (which puts a lot of noise into your spectrum giving you the false impression that you have a noise floor).
So, my recommendation, if you primarily care about magnitude:
- develop a hamming window for the number of points of valid data you have.
- apply the hamming window to the data you have
After that you have OPTIONS:
A) if your samples are slightly above a base two number, use the lower base two number (i.e. if you have 1400 points, do two 1024 point FFTs with overlap). The results of these two FFTs can be "smartly" combined for an aggregate spectrum. Depending on your fidelity needs, you can do this with more FFTs with a larger portion of overlapped data. Try to keep the overlap less that 10% to account for your window edges that will get attenuated by the start and end of the windowing functions.
B) place your windowed data anywhere in the FFT input vector (beginning, middle or end, it should only impact your phase results - which is why I asked if phase is important).
If it turns out phase is important, start your valid windowed data at the beginning of the FFT vector.
Regarding your spectrum observations (I just went through the same thing two weeks ago). If you are looking at a wave file converted from a lossy compression, you are going to be starting with a band limited signal, so expect the spectrum to do an abrupt drop. My first lossless wave file plot had a huge bald spot from Fs/10 -> 9Fs/10 (which is expected). For your plots - also display your data in logarithmic bins (linear bins will give you misleading info and squish the lower frequency elements which are the bulk of the signal in compressed music files).
FYI - I recommended hamming (because I did the same thing). A decoded compressed audio signal will only use a portion of your spectrum (decoding a 320kbps stream is sampled at 10Khz), even when decoded to 44.1Khz representation, all of the interesting data should be below 5Khz.
Best of luck
J.R.
P.S. this is my first post here, chime back if you want some pretty pictures from TeraPlot.
This is a question for http://dsp.stackexchange.com but yes, zero-padding is perfectly legitimate here.
Here’s why the filtered signal (once it’s back in the time-domain) goes to zero after some time: imagine linearly-convolving the zero-padded signal with your low-pass filter’s impulse response (using the slow O(N^2) time-domain filter implementation). The output will go to zero after the original signal is done, when the filter is just being fed with zeros, right? That result will be the same as the output of FFT-based fast convolution. It’s perfectly normal. Just crop the output signal to the same length of the input and move on with your life.
Caveat on FFT orders: just because power-of-two FFT lengths are “the fastest” in terms of operation count, while FFTs of lengths with low prime factors (3, 5, 7) have slightly higher operation counts, you may find that zero-padding to a low-prime-factor is faster in terms of real-world runtime because of memory costs. A pathological example: if you have a 1025-long signal, you probably don’t want to zero-pad to 2048 and eat the cost of allocating a nearly 2x memory buffer, and running a nearly 2x longer FFT. You’d try 1080-length FFT or something (1080 = 2^3 * 3^3 * 5: nextprod is your friend) and wouldn’t be surprised if it completed much faster than power-of-two.

How can I detect these audio abnormalities?

iOS has an issue recording through some USB audio devices. It cannot be reliably reproduced (happens every 1 in ~2000-3000 records in batches and silently disappears), and we currently manually check our audio for any recording issues. It results in small numbers of samples (1-20) being shifted by a small number that sounds like a sort of 'crackle'.
They look like this:
closer:
closer:
another, single sample error elsewhere in the same audio file:
The question is, how can these be algorithmically be detected (assuming direct access to samples) whilst not triggering false positives on high frequency audio with waveforms like this:
Bonus points: after determining as many errors as possible, how can the audio be 'fixed'?
Dirty audio file - pictured
Another dirty audio file
Clean audio with valid high frequency - pictured
More bonus points: what could be causing this issue in the iOS USB audio drivers/hardware (assuming it is there).
I do not think there is an out of the box solution to find the disturbances, but here is one (non standard) way of tackling the problem. Using this, I could find most intervals and I only got a small number of false positives, but the algorithm could certainly use some fine tuning.
My idea is to find the start and end point of the deviating samples. The first step should be to make these points stand out more clearly. This can be done by taking the logarithm of the data and taking the differences between consecutive values.
In MATLAB I load the data (in this example I use dirty-sample-other.wav)
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
and use the following code:
logdata = log(1+data);
difflogdata = diff(logdata);
So instead of this plot of the original data:
we get:
where the intervals we are looking for stand out as a positive and negative spike. For example zooming in on the largest positive value in the plot of logarithm differences we get the following two figures. One for the original data:
and one for the difference of logarithms:
This plot could help with finding the areas manually but ideally we want to find them using an algorithm. The way I did this was to take a moving window of size 6, computing the mean value of the window (of all points except the minimum value), and compare this to the maximum value. If the maximum point is the only point that is above the mean value and at least twice as large as the mean it is counted as a positive extreme value.
I then used a threshold of counts, at least half of the windows moving over the value should detect it as an extreme value in order for it to be accepted.
Multiplying all points with (-1) this algorithm is then run again to detect the minimum values.
Marking the positive extremes with "o" and negative extremes with "*" we get the following two plots. One for the differences of logarithms:
and one for the original data:
Zooming in on the left part of the figure showing the logarithmic differences we can see that most extreme values are found:
It seems like most intervals are found and there are only a small number of false positives. For example running the algorithm on 'clean-highfreq.wav' I only find one positive and one negative extreme value.
Single values that are falsely classified as extreme values could perhaps be weeded out by matching start and end-points. And if you want to replace the lost data you could use some kind of interpolation using the surrounding data-points, perhaps even a linear interpolation will be good enough.
Here is the MATLAB-code I used:
function test20()
clc
clear all
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
logdata = log(1+data);
difflogdata = diff(logdata);
figure,plot(data),hold on,plot(data,'.')
figure,plot(difflogdata),hold on,plot(difflogdata,'.')
figure,plot(data),hold on,plot(data,'.'),xlim([68000,68200])
figure,plot(difflogdata),hold on,plot(difflogdata,'.'),xlim([68000,68200])
k = 6;
myData = difflogdata;
myPoints = findPoints(myData,k);
myData2 = -difflogdata;
myPoints2 = findPoints(myData2,k);
figure
plotterFunction(difflogdata,myPoints>=k,'or')
hold on
plotterFunction(difflogdata,myPoints2>=k,'*r')
figure
plotterFunction(data,myPoints>=k,'or')
hold on
plotterFunction(data,myPoints2>=k,'*r')
end
function myPoints = findPoints(myData,k)
iterationVector = k+1:length(myData);
myPoints = zeros(size(myData));
for i = iterationVector
subVector = myData(i-k:i);
meanSubVector = mean(subVector(subVector>min(subVector)));
[maxSubVector, maxIndex] = max(subVector);
if (sum(subVector>meanSubVector) == 1 && maxSubVector>2*meanSubVector)
myPoints(i-k-1+maxIndex) = myPoints(i-k-1+maxIndex) +1;
end
end
end
function plotterFunction(allPoints,extremeIndices,markerType)
extremePoints = NaN(size(allPoints));
extremePoints(extremeIndices) = allPoints(extremeIndices);
plot(extremePoints,markerType,'MarkerSize',15),
hold on
plot(allPoints,'.')
plot(allPoints)
end
Edit - comments on recovering the original data
Here is a slightly zoomed out view of figure three above: (the disturbance is between 6.8 and 6.82)
When I examine the values, your theory about the data being mirrored to negative values does not seem to fit the pattern exactly. But in any case, my thought about just removing the differences is certainly not correct. Since the surrounding points do not seem to be altered by the disturbance, I would probably go back to the original idea of not trusting the points within the affected region and instead using some sort of interpolation using the surrounding data. It seems like a simple linear interpolation would be a quite good approximation in most cases.
To answer the question of why it happens -
A USB audio device and host are not clock synchronous - that is to say that the host cannot accurately recover the relationship between the host's local clock and the word-clock of the ADC/DAC on the audio interface. Various techniques do exist for clock-recovery with various degrees of effectiveness. To add to the problem, the bus clock is likely to be unrelated to either of the two audio clocks.
Whilst you might imagine this not to be too much of a concern for audio receive - audio capture callbacks could happen when there is data - audio interfaces are usually bi-directional and the host will be rendering audio at regular interval, which the other end is potentially consuming at a slightly different rate.
In-between are several sets of buffers, which can over- or under-run, which is what looks to be happening here; the interval between it happening certainly seems about right.
You might find that changing USB audio device to one built around a different chip-set (or, simply a different local oscillator) helps.
As an aside both IEEE1394 audio and MPEG transport streams have the same clock recovery requirement. Both of them solve the problem with by embedding a local clock reference packet into the serial bitstream in a very predictable way which allows accurate clock recovery on the other end.
I think the following algorithm can be applied to samples in order to determine a potential false positive:
First, scan for high amount of high frequency, either via FFT'ing the sound block by block (256 values maybe), or by counting the consecutive samples above and below zero. The latter should keep track of maximum consecutive above zero, maximum consecutive below zero, the amount of small transitions around zero and the current volume of the block (0..1 as Audacity displays it). Then, if the maximum consecutive is below 5 (sampling at 44100, and zeroes be consecutive, while outstsanding samples are single, 5 responds to 4410Hz frequency, which is pretty high), or the sum of small transitions' lengths is above a certain value depending on maximum consecutive (I believe the first approximation would be 3*5*block size/distance between two maximums, which roughly equates to period of the loudest FFT frequency. Also it should be measured both above and below threshold, as we can end up with an erroneous peak, which will likely be detected by difference between main tempo measured on below-zero or above-zero maximums, also by std-dev of peaks. If high frequency is dominant, this block is eligible only for zero-value testing, and a special means to repair the data will be needed. If high frequency is significant, that is, there is a dominant low frequency detected, we can search for peaks bigger than 3.0*high frequency volume, as well as abnormal zeroes in this block.
Also, your gaps seem to be either highly extending or plain zero, with high extends to be single errors, and zero errors range from 1-20. So, if there is a zero range with values under 0.02 absolute value, which is directly surrounded by values of 0.15 (a variable to be finetuned) or higher absolute value AND of the same sign, count this point as an error. Single values that stand out can be detected if you calculate 2.0*(current sample)-(previous sample)-(next sample) and if it's above a certain threshold (0.1+high frequency volume, or 3.0*high frequency volume, whichever is bigger), count this as an error and average.
What to do with zero gaps found - we can copy values from 1 period backwards and 1 period forwards (averaging), where "period" is of the most significant frequency of the FFT of the block. If the "period" is smaller than the gap (say we've detected a gap of zeroes in a high-pitched part of the sound), use two or more periods, so the source data will all be valid (in this case, no averaging can be done, as it's possible that the signal 2 periods forward from the gap and 2 periods back will be in counterphase). If there are more than one frequency of about equal amplitude, we can plain sample these with correct phases, cutting the rest of less significant frequencies altogether.
The outstanding sample should IMO just be averaged by 2-4 surrounding samples, as there seems to be only a single sample ever encountered in your sound files.
The discrete wavelet transform (DWT) may be the solution to your problem.
A FFT calculation is not very useful in your case since its an average representation of relative frequency content over the entire duration of the signal, and thus impossible to detect momentary changes. The dicrete short time frequency transform (STFT) tries to tackle this by computing the DFT for short consecutive time-blocks of the signal, the length of which is determine by the length (and shape) of a window, but since the resolution of the DFT is dependent on the data/block-length, there is a trade-off between resolution in freqency OR in time, and finding this magical fixed window-size can be tricky!
What you want is a time-frequency analysis method with good time resolution for high-frequency events, and good frequency resolution for low-frequency events... Enter the discrete wavelet transform!
There are numerous wavelet transforms for different applications and as you might expect, it's computationally heavy. The DWT may not be practical solution to your problem, but it's worth considering. Good luck with your problem. Some friday-evening reading:
http://klapetek.cz/wdwt.html
http://etd.lib.fsu.edu/theses/available/etd-11242003-185039/unrestricted/09_ds_chapter2.pdf
http://en.wikipedia.org/wiki/Wavelet_transform
http://en.wikipedia.org/wiki/Discrete_wavelet_transform
You can try the following super-simple approach (maybe it's enough):
Take each point in your wave-form and subtract its predecessor (look at the changes from one point to the next).
Look at the distribution of these changes and find their standard deviation.
If any given difference is beyond X times this standard deviation (either above or below), flag it as a problem.
Determine the best value for X by playing with it and seeing how well it performs.
Most "problems" should come as a pair of two differences beyond your cutoff, one going up, and one going back down.
To stick with the super-simple approach, you can then fix the data by just interpolating linearly between the last good point before your problem-section and the first good point after. (Make sure you don't just delete the points as this will influence (raise) the pitch of your audio.)

Second order low-pass filter algorithm

I need to filter some noise from a signal and a simple RC first order filter seems not to be enough. I've been looking around but I haven't found algorithms for other filters (although many examples of how to do it with analogue circuits). Can somebody pinpoint where can I find such algorithms? Or at least write one here?
For clarification: I take the signal from an oscilloscope, and I only have one cycle. This cycle looks a little bit like:
125 * (x > 3 ? exp(-(x - 3) / 2) : exp(5*(x - 3)))
Now, the signal not always have that shape and I need to compute the derivate of the signal, which is easy if not because when one zooms the signal enough (each point is 160 nano seconds appart) you can see a lot of noise. So, before computing derivatives I need to flattern the signal.
If you are asking for how to design a higher order filter than a simple first order, how about choosing a filter from here:wiki on Filter_(signal_processing)
Just hypothesizing about your question, so here are a couple of design points.
1) You probably don't want to have ripple (varying gain) in your pass band, as that would distort your signal.
2) You probably don't care about having ripple in your stop band, as the signal should be close to 0 there anyway.
3) The higher the order of the filter, the more it looks like a ideal square shaped filter.
4) The higher the rolloff the better, you want to cut down on the noise outside of your passband as quickly as possible.
5) You may or may not care about "group delay", which is a measure of the distortion caused by different frequencies taking different times to pass through the filter. For audio, you probably want a not too high group delay, as you can imagine having different frequency components undergoing different time (and thus phase) shifts will cause some distortion.
Once you select the filter you want based on these (and possibly other) considerations, then simply implement it using some topology, like those mentioned here
With only a vague description of your requirements it's hard to give any specific suggestions.
You need to specify the parameters of your filter: sample rate, cut-off frequency, width of transition band, pass-band ripple, minimum stop-band rejection, whether phase and group delay are an issue, etc. Once you have at least some of these parameters pinned down then you can start the process of selecting an appropriate filter design, i.e. basic filter type, number of stages, etc.
It would also be helpful to know what kind of signal you want to filter - is it audio, or something else ? How many bits per sample ?
You need a good definition of your signal, a good analysis of your noise, and a clear understanding of the difference between the two, in order to determine what algorithms might be appropriate for removing one and not eliminating information in the other. Then you need to define the computational environment (integer or float ALU, add and multiply cycles?), and set a computational budget. There's a big difference between a second-order IIR and a giga-point FFT.
Some very commonly used 2nd-order digital filters are described in RBJ's biquad cookbook.

Looking for ideas for a simple pattern matching algorithm to run on a microcontroller

I'm working on a project to recognize simple audio patterns. I have two data sets, each made up of between 4 and 32 note/duration pairs. One set is predefined, the other is from an incoming data stream. The length of the two strongly correlated data sets is often different, but roughly the same "shape". My goal is to come up with some sort of ranking as to how well the two data sets correlate/match.
I have converted the incoming frequencies to pitch and shifted the incoming data stream's pitch so that it's average pitch matches that of the predefined data set. I also stretch/compress the incoming data set's durations to match the overall duration of the predefined set. Here are two graphical examples of data that should be ranked as strongly correlated:
http://s2.postimage.org/FVeG0-ee3c23ecc094a55b15e538c3a0d83dd5.gif
(Sorry, as a new user I couldn't directly post images)
I'm doing this on a 8-bit microcontroller so resources are minimal. Speed is less an issue, a second or two of processing isn't a deal breaker.
It wouldn't surprise me if there is an obvious solution, I've just been staring at the problem too long. Any ideas?
Thanks in advance...
Couldn't see the graphic, but... Divide the spectrum into bins. You've probably already done this already , but they may be too fine. Depending on your application, consider dividing the spectrum into, say 16 or 32 bins, maybe logarithmically, since that is how we hear. Then, compare the ratios of the power in each bin. E.g, compare the ratio of 500 Hz to 1000 Hz in the first sample with that same ratio in the 2nd sample. That gets rid of any problem with unequal amplitudes of the samples.
1D signal matching is often done with using the convolution function. However, this may be processor intensive.
A simpler algorithm that could be used is to first check if the durations of each note the two signals are roughly equal. Then if check the next-frequency pattern of the two signals are the same. What I mean by next-frequency pattern is to decompose the ordered list of frequencies to an ordered list of whether or not the next frequency is higher or lower. So something that goes 500Hz to 1000Hz to 700Hz to 400Hz would simply become Higher-Lower-Lower. This may be good enough, depending on your purposes.

Resources