time series data change point detection - algorithm

I am trying to segment a time series data into different zones.
In each time period, the pressure is running under an allowed max stress level (was not told before hand). Please see the pictures below.
edit each time period is more than a week.
How to detect the start / end of different time period? Would anyone point me some direction?
Once the time different time zones are divided, I guess I could average several max readings in each zone to have the max allowed stress.

I would take let's say enough values for 1h. Then you calculate the average value.
After that, you set the the average value in relation with the one before.
Some Pseudocode, to make it visual.
class Chunk:
private double[] values;//For one hour, for example.
double average();
enum Relation:
FALLING,RISING,EQUAL
func algorithm(Chunk[] chunks){
double averages=new double[chunks.length];
for(int i=0;i<chunks.length;i++)
averages[i]=chunks[i].average();
//Got averages, now make it rising or falling or stay same.
Relation[] relations=new Relation[chunks.length];
for(int i=1;i<chunks.length;i++){
double diff=averages[i]-averages[i-1];
if(diff==0) //TODO, a bit of difference is allowed (Like deviations of +-3)
relations[i]=EQUALS;
else
relations[i]=diff>0?RISING:FALLING;
}
// After that, you have to find sequences of many FALLING or RISING, followed by many EQUALS
}
To proceed with this array of Relations, you could divide it into smaller arrays, calculate the average (Like FALLING=0,RISING=1,EQUAL=2). After that you simply "merge" them like this:
F=FALLING
R=RISING
E=EQUALS
//Before merging
[RREEEEFFEEEEERRREEEE]
//After merging
[REFERE]
And there you can see the mountains and valleys.
Now, to get the exact values, when a mountain or valley starts, you have to extend Chunk a bit.
class Chunk:
//The value on x-Axis + the value of y-Axis
private Tuple<Time,Double>[] values;
//Tuple of Range, this chunk uses and the average value of this range
Tuple<Tuple<Time,Time>,double> average();
Furthermore, you can't use raw Relation anymore, you have to wrap it with the Range, from where it starts to the end.

Related

How to measure "homogeneity" of time series?

I have two time series, see this pic:
I need to measure the level of "homogeneity" of the series. So the first one looks very fragmented, so it should have low value close to zero and the second one should have a high value.
Any ideas of an algorithm I could use?
I'm not sure what is meant by homogeneity, but there is a well-established notion of stationarity of a time series. Basically, a time series is stationary if its rolling mean and standard deviation are constant across time. Both of your time series seem to have roughly constant mean, but the top one has a standard deviation that changes wildly across time; sometimes it's almost zero, and at other times, it's very large. Perhaps you could take the standard deviation of the rolling standard deviation, which will be far higher for the top series than for the bottom. If you can load them into pandas as top and bottom, it might look like
top_nonstationarity = np.std(top.rolling(window_size).std())
bottom_nonstationarity = np.std(bottom.rolling(window_size).std())
It might help to know more about the underlying difference between the series, or what you care about, but here goes...
I would subtract constants, if required, to give both series mean zero, and then square them to get something resembling power and filter this enough to smooth away what seems to be noise in the case of the lower filter. Then compute and compare the variances of the two filtered powers, which for the lower time series I would now expect to be a fairly constant line with a few drops down and for the upper series something spending about half of its time near zero and about half of its time away from it.
Possible filters include a simple moving average, whatever your time series toolkit provides, and those described at https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter

Produce a similarity ranking for a set of time series signals

Given a set of 24 hour signal with hourly data points which represent energy consumption patterns, provide each with a similarity score? The peaks vary in their height, width and placement in the signal. The aim is to rank the signals such that those with closer scores are more similar than those with further scores (like how age or income works). I.e., there should be a lower gap between these two signals' scores
than these.
Finding the correlation between each one with a base case was not adequate since if one signal had a high peak in the morning and a low one in the afternoon, a signal with peaks in the opposite pattern would be classified as similar to the first signal. Returns correlation was also not suitable. The same issue was produced when using RMSE between signals and a base case.
After some thought, I attempted to find the peaks of a signal and then score a peak in the following way:
public double score(){
int b1 = max-start;
int b2 = end - max;
double h1 = maxHeight-startHeight;
double h2 = maxHeight - endHeight;
double a1 = 0.5*h1*b1;
double a2 = 0.5*h2*b2;
return Math.sqrt(Math.pow(h1,2)+ Math.pow(h2,2));
}
Where start, max, and end represent the start, max, and end times of the peak, respectively.
I think this could be a working method; however, I'm having difficulty finding the peaks themselves. All the methods I've tried have some flaws.
I have tried the method in this post: Peak signal detection in realtime timeseries data
Some peaks were defined as starting too early. Since some peaks could persist for several hours, I tried making lag longer. However, if the lag was too long, peaks beginning before time=lag were missed.
I also tried to use standard deviation of the gradient as a signal that a peak was beginning. I.e., if gradient of a given point is factor*stdev(all gradients) then a peak is beginning. *Factor was 0.6
This failed when certain signals had one very steep peak in the evening and a shallower one in the morning (or vice versa). The stdev of the gradient would be too high and the algorithm missed the shallower peak. If I made the factor low enough to pick up the shallow peak as well, false peaks were detected.
Inspired by the method in the post above, I tried using a moving stdev of the gradient. However, this algorithm still misses some peaks.

How to detect the precise sampling interval from samples stored in a database?

A hardware sensor is sampled precisely (precise period of sampling) using a real-time unit. However, the time value is not sent to the database together with the sampled value. Instead, time of insertion of the record to the database is stored for the sample in the database. The DATETIME type is used, and the GETDATE() function is used to get current time (Microsoft SQL Server).
How can I reconstruct the precise sampling times?
As the sampling interval is (should be) 60 seconds exactly, there was no need earlier for more precise solution. (This is an old solution, third party, with a lot of historical samples. This way it is not possible to fix the design.)
For processing of the samples, I need to reconstruct the correct time instances for the samples. There is no problem with shifting the time of the whole sequence (that is, it does not matter whether the start time is rather off, not absolute). On the other hand, the sampling interval should be detected as precisely as possible. I also cannot be sure, that the sampling interval was exactly 60 seconds (as mentioned above). I also cannot be sure, that the sampling interval was really constant (say, slight differences based on temperature of the device).
When processing the samples, I want to get:
start time
the sampling interval
the sequence o the sample values
When reconstructing the samples, I need to convert it back to tuples:
time of the sample
value of the sample
Because of that, for the sequence with n samples, the time of the last sample should be equal to start_time + sampling_interval * (n - 1), and it should be reasonably close to the original end time stored in the database.
Think in terms of the stored sample times slightly oscillate with respect to the real sample-times (the constant delay between the sampling and the insertion into the database is not a problem here).
I was thinking about calculating the mean value and the corrected standard deviation for the interval calculated from the previous and current sample times.
Discontinuity detection: If the calculated interval is greater than 3 sigma off the mean value, I would consider it a discontinuity of the sampled curve (say, the machine is switched off, or any outer event lead to missing samples. In the case, I want to start with processing a new sequence. (The sampling frequency could also be changed.)
Is there any well known approach to the problem. If yes, can you point me to the article(s)? Or can you give me the name or acronym of the algorithm?
+1 to looking at the difference sequence. We can model the difference sequence as the sum of a low frequency truth (the true rate of the samples, slowly varying over time) and high frequency noise (the random delay to get the sample into the database). You want a low-pass filter to remove the latter.

How can I detect these audio abnormalities?

iOS has an issue recording through some USB audio devices. It cannot be reliably reproduced (happens every 1 in ~2000-3000 records in batches and silently disappears), and we currently manually check our audio for any recording issues. It results in small numbers of samples (1-20) being shifted by a small number that sounds like a sort of 'crackle'.
They look like this:
closer:
closer:
another, single sample error elsewhere in the same audio file:
The question is, how can these be algorithmically be detected (assuming direct access to samples) whilst not triggering false positives on high frequency audio with waveforms like this:
Bonus points: after determining as many errors as possible, how can the audio be 'fixed'?
Dirty audio file - pictured
Another dirty audio file
Clean audio with valid high frequency - pictured
More bonus points: what could be causing this issue in the iOS USB audio drivers/hardware (assuming it is there).
I do not think there is an out of the box solution to find the disturbances, but here is one (non standard) way of tackling the problem. Using this, I could find most intervals and I only got a small number of false positives, but the algorithm could certainly use some fine tuning.
My idea is to find the start and end point of the deviating samples. The first step should be to make these points stand out more clearly. This can be done by taking the logarithm of the data and taking the differences between consecutive values.
In MATLAB I load the data (in this example I use dirty-sample-other.wav)
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
and use the following code:
logdata = log(1+data);
difflogdata = diff(logdata);
So instead of this plot of the original data:
we get:
where the intervals we are looking for stand out as a positive and negative spike. For example zooming in on the largest positive value in the plot of logarithm differences we get the following two figures. One for the original data:
and one for the difference of logarithms:
This plot could help with finding the areas manually but ideally we want to find them using an algorithm. The way I did this was to take a moving window of size 6, computing the mean value of the window (of all points except the minimum value), and compare this to the maximum value. If the maximum point is the only point that is above the mean value and at least twice as large as the mean it is counted as a positive extreme value.
I then used a threshold of counts, at least half of the windows moving over the value should detect it as an extreme value in order for it to be accepted.
Multiplying all points with (-1) this algorithm is then run again to detect the minimum values.
Marking the positive extremes with "o" and negative extremes with "*" we get the following two plots. One for the differences of logarithms:
and one for the original data:
Zooming in on the left part of the figure showing the logarithmic differences we can see that most extreme values are found:
It seems like most intervals are found and there are only a small number of false positives. For example running the algorithm on 'clean-highfreq.wav' I only find one positive and one negative extreme value.
Single values that are falsely classified as extreme values could perhaps be weeded out by matching start and end-points. And if you want to replace the lost data you could use some kind of interpolation using the surrounding data-points, perhaps even a linear interpolation will be good enough.
Here is the MATLAB-code I used:
function test20()
clc
clear all
y1 = wavread('dirty-sample-pictured.wav');
y2 = wavread('dirty-sample-other.wav');
y3 = wavread('clean-highfreq.wav');
data = y2;
logdata = log(1+data);
difflogdata = diff(logdata);
figure,plot(data),hold on,plot(data,'.')
figure,plot(difflogdata),hold on,plot(difflogdata,'.')
figure,plot(data),hold on,plot(data,'.'),xlim([68000,68200])
figure,plot(difflogdata),hold on,plot(difflogdata,'.'),xlim([68000,68200])
k = 6;
myData = difflogdata;
myPoints = findPoints(myData,k);
myData2 = -difflogdata;
myPoints2 = findPoints(myData2,k);
figure
plotterFunction(difflogdata,myPoints>=k,'or')
hold on
plotterFunction(difflogdata,myPoints2>=k,'*r')
figure
plotterFunction(data,myPoints>=k,'or')
hold on
plotterFunction(data,myPoints2>=k,'*r')
end
function myPoints = findPoints(myData,k)
iterationVector = k+1:length(myData);
myPoints = zeros(size(myData));
for i = iterationVector
subVector = myData(i-k:i);
meanSubVector = mean(subVector(subVector>min(subVector)));
[maxSubVector, maxIndex] = max(subVector);
if (sum(subVector>meanSubVector) == 1 && maxSubVector>2*meanSubVector)
myPoints(i-k-1+maxIndex) = myPoints(i-k-1+maxIndex) +1;
end
end
end
function plotterFunction(allPoints,extremeIndices,markerType)
extremePoints = NaN(size(allPoints));
extremePoints(extremeIndices) = allPoints(extremeIndices);
plot(extremePoints,markerType,'MarkerSize',15),
hold on
plot(allPoints,'.')
plot(allPoints)
end
Edit - comments on recovering the original data
Here is a slightly zoomed out view of figure three above: (the disturbance is between 6.8 and 6.82)
When I examine the values, your theory about the data being mirrored to negative values does not seem to fit the pattern exactly. But in any case, my thought about just removing the differences is certainly not correct. Since the surrounding points do not seem to be altered by the disturbance, I would probably go back to the original idea of not trusting the points within the affected region and instead using some sort of interpolation using the surrounding data. It seems like a simple linear interpolation would be a quite good approximation in most cases.
To answer the question of why it happens -
A USB audio device and host are not clock synchronous - that is to say that the host cannot accurately recover the relationship between the host's local clock and the word-clock of the ADC/DAC on the audio interface. Various techniques do exist for clock-recovery with various degrees of effectiveness. To add to the problem, the bus clock is likely to be unrelated to either of the two audio clocks.
Whilst you might imagine this not to be too much of a concern for audio receive - audio capture callbacks could happen when there is data - audio interfaces are usually bi-directional and the host will be rendering audio at regular interval, which the other end is potentially consuming at a slightly different rate.
In-between are several sets of buffers, which can over- or under-run, which is what looks to be happening here; the interval between it happening certainly seems about right.
You might find that changing USB audio device to one built around a different chip-set (or, simply a different local oscillator) helps.
As an aside both IEEE1394 audio and MPEG transport streams have the same clock recovery requirement. Both of them solve the problem with by embedding a local clock reference packet into the serial bitstream in a very predictable way which allows accurate clock recovery on the other end.
I think the following algorithm can be applied to samples in order to determine a potential false positive:
First, scan for high amount of high frequency, either via FFT'ing the sound block by block (256 values maybe), or by counting the consecutive samples above and below zero. The latter should keep track of maximum consecutive above zero, maximum consecutive below zero, the amount of small transitions around zero and the current volume of the block (0..1 as Audacity displays it). Then, if the maximum consecutive is below 5 (sampling at 44100, and zeroes be consecutive, while outstsanding samples are single, 5 responds to 4410Hz frequency, which is pretty high), or the sum of small transitions' lengths is above a certain value depending on maximum consecutive (I believe the first approximation would be 3*5*block size/distance between two maximums, which roughly equates to period of the loudest FFT frequency. Also it should be measured both above and below threshold, as we can end up with an erroneous peak, which will likely be detected by difference between main tempo measured on below-zero or above-zero maximums, also by std-dev of peaks. If high frequency is dominant, this block is eligible only for zero-value testing, and a special means to repair the data will be needed. If high frequency is significant, that is, there is a dominant low frequency detected, we can search for peaks bigger than 3.0*high frequency volume, as well as abnormal zeroes in this block.
Also, your gaps seem to be either highly extending or plain zero, with high extends to be single errors, and zero errors range from 1-20. So, if there is a zero range with values under 0.02 absolute value, which is directly surrounded by values of 0.15 (a variable to be finetuned) or higher absolute value AND of the same sign, count this point as an error. Single values that stand out can be detected if you calculate 2.0*(current sample)-(previous sample)-(next sample) and if it's above a certain threshold (0.1+high frequency volume, or 3.0*high frequency volume, whichever is bigger), count this as an error and average.
What to do with zero gaps found - we can copy values from 1 period backwards and 1 period forwards (averaging), where "period" is of the most significant frequency of the FFT of the block. If the "period" is smaller than the gap (say we've detected a gap of zeroes in a high-pitched part of the sound), use two or more periods, so the source data will all be valid (in this case, no averaging can be done, as it's possible that the signal 2 periods forward from the gap and 2 periods back will be in counterphase). If there are more than one frequency of about equal amplitude, we can plain sample these with correct phases, cutting the rest of less significant frequencies altogether.
The outstanding sample should IMO just be averaged by 2-4 surrounding samples, as there seems to be only a single sample ever encountered in your sound files.
The discrete wavelet transform (DWT) may be the solution to your problem.
A FFT calculation is not very useful in your case since its an average representation of relative frequency content over the entire duration of the signal, and thus impossible to detect momentary changes. The dicrete short time frequency transform (STFT) tries to tackle this by computing the DFT for short consecutive time-blocks of the signal, the length of which is determine by the length (and shape) of a window, but since the resolution of the DFT is dependent on the data/block-length, there is a trade-off between resolution in freqency OR in time, and finding this magical fixed window-size can be tricky!
What you want is a time-frequency analysis method with good time resolution for high-frequency events, and good frequency resolution for low-frequency events... Enter the discrete wavelet transform!
There are numerous wavelet transforms for different applications and as you might expect, it's computationally heavy. The DWT may not be practical solution to your problem, but it's worth considering. Good luck with your problem. Some friday-evening reading:
http://klapetek.cz/wdwt.html
http://etd.lib.fsu.edu/theses/available/etd-11242003-185039/unrestricted/09_ds_chapter2.pdf
http://en.wikipedia.org/wiki/Wavelet_transform
http://en.wikipedia.org/wiki/Discrete_wavelet_transform
You can try the following super-simple approach (maybe it's enough):
Take each point in your wave-form and subtract its predecessor (look at the changes from one point to the next).
Look at the distribution of these changes and find their standard deviation.
If any given difference is beyond X times this standard deviation (either above or below), flag it as a problem.
Determine the best value for X by playing with it and seeing how well it performs.
Most "problems" should come as a pair of two differences beyond your cutoff, one going up, and one going back down.
To stick with the super-simple approach, you can then fix the data by just interpolating linearly between the last good point before your problem-section and the first good point after. (Make sure you don't just delete the points as this will influence (raise) the pitch of your audio.)

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

Resources