finding peaks and troughs, Part II (with corresponding definition) - algorithm

this is an update to the previous question that I had about locating peaks and troughs. The previous question was this:
peaks and troughs in MATLAB (but with corresponding definition of a peak and trough)
This time around, I did the suggested answer, but I think there is still something wrong with the final algorithm. Can you please tell me what I did wrong in my code? Thanks.
function [vectpeak, vecttrough]=peaktroughmodified(x,cutoff)
% This function is a modified version of the algorithm used to identify
% peaks and troughs in a series of prices. This will be used to identify
% the head and shoulders algorithm. The function gives you two vectors:
% PEAKS - an indicator vector that identifies the peaks in the function,
% and TROUGHS - an indicator vector that identifies the troughs of the
% function. The input is the vector of exchange rate series, and the cutoff
% used for refining possible peaks and troughs.
% Finding all possible peaks and troughs of our vector.
[posspeak,possploc]=findpeaks(x);
[posstrough,posstloc]=findpeaks(-x);
posspeak=posspeak';
posstrough=posstrough';
% Initialize vector of peaks and troughs.
numobs=length(x);
prelimpeaks=zeros(numobs,1);
prelimtroughs=zeros(numobs,1);
numpeaks=numel(possploc);
numtroughs=numel(posstloc);
% Indicator for possible peaks and troughs.
for i=1:numobs
for j=1:numpeaks
if i==possploc(j);
prelimpeaks(i)=1;
end
end
end
for i=1:numobs
for j=1:numtroughs
if i==posstloc(j);
prelimtroughs(i)=1;
end
end
end
% Vector that gives location.
location=1:1:numobs;
location=location';
% From the list of possible peaks and troughs, find the peaks and troughs
% that fit Chang and Osler [1999] definition.
% "A peak is a local minimum at least x percent higher than the preceding
% trough, and a trough is a local minimum at least x percent lower than the
% preceding peak." [Chang and Osler, p.640]
% cutoffs
peakcutoff=1.0+cutoff; % cutoff for peaks
troughcutoff=1.0-cutoff; % cutoff for troughs
% First peak and first trough are initialized as previous peaks/troughs.
prevpeakloc=possploc(1);
prevtroughloc=posstloc(1);
% Initialize vectors of final peaks and troughs.
vectpeak=zeros(numobs,1);
vecttrough=zeros(numobs,1);
% We first check whether we start looking for peaks and troughs.
for i=1:numobs
if prelimpeaks(i)==1;
if i>prevtroughloc;
ratio=x(i)/x(prevtroughloc);
if ratio>peakcutoff;
vectpeak(i)=1;
prevpeakloc=location(i);
else vectpeak(i)=0;
end
end
elseif prelimtroughs(i)==1;
if i>prevpeakloc;
ratio=x(i)/x(prevpeakloc);
if ratio<troughcutoff;
vecttrough(i)=1;
prevtroughloc=location(i);
else vecttrough(i)=0;
end
end
else
vectpeak(i)=0;
vecttrough(i)=0;
end
end
end

I just ran it, and it seems to work if you make this change:
peakcutoff= 1/cutoff; % cutoff for peaks
troughcutoff= cutoff; % cutoff for troughs
I tested it with the following code, with a cutoff of 0.1 (peaks must be 10 times larger than troughs), and it looks reasonable
x = randn(1,100).^2;
[vectpeak,vecttrough] = peaktroughmodified(x,0.1);
peaks = find(vectpeak);
troughs = find(vecttrough);
plot(1:100,x,peaks,x(peaks),'o',troughs,x(troughs),'o')
I strongly urge you to read up on vectorization in matlab. There are many wasted lines in your program, and it makes it difficult to read and will also make it very slow with big datasets. For instance, prelimpeaks and prelimtroughs can be completely defined without loops, in a single line for each:
prelimpeaks(possploc) = 1;
prelimtroughs(posstloc) = 1;

I think there are better techniques for finding peaks and troughs than the percentage threshold technique given above. Fit the least squares fit parabola to the data set, a technique for doing this is in the 1946 Frank Peters paper, "Parabolic Correlation, a New Descrptive Statistic." The fitted parabola will likely have an index of curvature, as Peters defines it. Find peaks and troughs by testing which points, when eliminated, minimize the absolute value of the index of curvature of the parabola. Once these points are discovered, test for which are peaks and which are troughs by considering how the index of curvature changes when the point is excluded, which will depend on whether the original parabola had a positive or negative index of curvature. If you become concerned about contiguous points the elimination of which achieves the minimum absolute value curvature, constrain by setting a minimum distance the identified points must be from each other. Another constraint would have to be the number of points identified. Without this constraint, this algorithm would remove all but two points, a straight line without curvature.
Sometimes there are steep changes between contiguous points and both should be included in extreme points. Perhaps a percentage threshold test for contiguous points that overrides the minimum distance constraint would be useful.
Another solution might be to compute the Fast Fourier Transform of the series and remove points that minimize the lower spectra. FFT functions are more readily available than code that finds least square fits parabola. There is a matrix manipulation technique for determining the least square fit parabola that is easier to manage than Peter's approach. I saw it documented on the web someplace, but lost the link. Advice from anybody able to arrive at a least square fit parabola using matrix vector notation would be appreciated.

Related

Numerical instability?

I am working in a program that concerns the optimization of some objective function obj over the scalar beta. The true global minimum beta0 is set at beta0=1.
In the mwe below you can see that obj is constructed as the sum of the 100-R (here I use R=3) smallest eigenvalues of the 100x100 symmetric matrix u'*u. While around the true global minimum obj "looks good" when I plot the objective function evaluated at much larger values of beta the objective function becomes very unstable (here or running the mwe you can see that multiple local minima (and maxima) appear, associated with values of obj(beta) smaller than the true global minimum).
My guess is that there is some sort of "numerical instability" going on, but I am unable to find the source.
%Matrix dimensions
N=100;
T=100;
%Reproducibility
rng('default');
%True global minimum
beta0=1;
%Generating data
l=1+randn(N,2);
s=randn(T+1,2);
la=1+randn(N,2);
X(1,:,:)=1+(3*l+la)*(3*s(1:T,:)+s(2:T+1,:))';
s=s(1:T,:);
a=(randn(N,T));
Y=beta0*squeeze(X(1,:,:))+l*s'+a;
%Give "beta" a large value
beta=1e6;
%Compute objective function
u=Y-beta*squeeze(X(1,:,:));
ev=sort(eig(u'*u)); % sort eigenvalues
obj=sum(ev(1:100-3))/(N*T); % "obj" is sum of 97 smallest eigenvalues
This evaluates the objective function at obj(beta=1e6). I have noticed that some of the eigenvalues from eig(u'*u) are negative (see object ev), when by construction the matrix u'*u is positive semidefinite
I am guessing this may have to do with floating point arithmetic issues and may (partly) be the answer to the instability of my function, but I am not sure.
Finally, this is what the objective function obj evaluated at a wide range of values for betalooks like:
% Now plot "obj" for a wide range of values of "beta"
clear obj
betaGrid=-5e5:100:5e5;
for i=1:length(betaGrid)
u=Y-betaGrid(i)*squeeze(X(1,:,:));
ev=sort(eig(u'*u));
obj(i)=sum(ev(1:100-3))/(N*T);
end
plot(betaGrid,obj,"*")
xlabel('\beta')
ylabel('obj')
This gives this figure, which shows how unstable it becomes for extreme values for beta.
The key here is noticing that computing eigenvalues can be a hard problem.
Actually the condition number for this problem is K = norm(A) * norm(inv(A)) (don't compute it this way, use cond(). This means the the an (relative) perturbation in the inpute (i.e. the matrix entries) gets amplified by the condition number when computing the output. I modified your code a little bit to compute and plot the condition number in each step. It turns out that for a large part of the range you are interested in it is greater than 10^17, which is abysmal. (Note that the double floating point numbers are accurate to not quite 16 significant (decimal) digits. This means even the representation error of double floating point numbers will here produce errors that make every digit "insignificant".) This already explains the bad behaviour. You should note that usually we can compute the largest eigenvalues quite accurately, the errors in the smaller (in magnitude) ones usually increase.
If the condition number was better (closer to 1) I would have suggested
computing the singular values, as they happen to be the eigenvalues (due to the symmetry). The svd is numerically more stable, but with this really bad
condition even this will not help. In the following modification of the
final snippet I added a graph that plots the condition number.
The only case where anything is salvageable is for R=0, then we actually
want to compute the sum of all eigenvalues, which happens to be the
trace of our matrix, which can easily be computed by just summing the
diagonal entries.
To summarize: This problem seems to have an inherent bad condition, so it doesn't really matter how you compute it. If you have a completely different formulation for the same problem that might help.
% Now plot "obj" for a wide range of values of "beta"
clear obj
L = 5e5; % decrease to 5e-1 to see that the condition number is still >1e9 around the optimum
betaGrid=linspace(-L,L,1000);
condition = nan(size(betaGrid));
for i=1:length(betaGrid)
disp(i/length(betaGrid))
u=Y-betaGrid(i)*squeeze(X(1,:,:));
A = u'*u;
ev=sort(eig(A));
condition(i) = cond(A);
obj(i)=sum(ev(1:100-3))/(N*t); % for R=0 use trace(A)/(N*T);
end
subplot(1,2,1);
plot(betaGrid,obj,"*")
xlabel('\beta')
ylabel('obj')
subplot(1,2,2);
semilogy(betaGrid, condition);
title('condition number');

Parametric Scoring Function or Algorithm

I'm trying to come up with a way to arrive at a "score" based on an integer number of "points" that is adjustable using a small number (3-5?) of parameters. Preferably it would be simple enough to reasonably enter as a function/calculation in a spreadsheet for tuning the parameters by the "designer" (not a programmer or mathematician). The first point has the most value and eventually additional points have a fixed or nearly fixed value. The transition from the initial slope of point value to final slope would be smooth. See example shapes below.
Points values are always positive integers (0 pts = 0 score)
At some point, curve is linear (or nearly), all additional points have fixed value
Preferably, parameters are understandable to a lay person, e.g.: "smoothness of the curve", "value of first point", "place where the additional value of points is fixed", etc
For parameters, an example of something ideal would be:
Value of first point: 10
Value of point #: 3 is: 5
Minimum value of additional points: 0.75
Exact shape of curve not too important as long as the corner can be more smooth or more sharp.
This is not for a game but more of a rating system with multiple components (several of which might use this kind of scale) will be combined.
This seems like a non-traditional kind of question for SO/SE. I've done mostly financial software in my career, I'm hoping there some domain wisdom for this kind of thing I can tap into.
Implementation of Prune's Solution:
Google Sheet
Parameters:
Initial value (a)
Second value (b)
Minimum value (z)
Your decay ratio is b/a. It's simple from here: iterate through your values, applying the decay at each step, until you "peg" at the minimum:
x[n] = max( z, a * (b/a)^n )
// Take the larger of the computed "decayed" value,
// and the specified minimum.
The sequence x is your values list.
You can also truncate intermediate results if you want integers up to a certain point. Just apply the floor function to each computed value, but still allow z to override that if it gets too small.
Is that good enough? I know there's a discontinuity in the derivative function, which will be noticeable if the minimum and decay aren't pleasantly aligned. You can adjust this with a relative decay, translating the exponential decay curve from y = 0 to z.
base = z
diff = a-z
ratio = (b-z) / diff
x[n] = z + diff * ratio^n
In this case, you don't need the max function, since the decay has a natural asymptote of 0.

Matlab nearest neighbor / track points

I have a set of n complex numbers that move through the complex plane from time step 1 to nsampl . I want to plot those numbers and their trace over time (y-axis shows imaginary part, x-axis the real part). The numbers are stored in a n x nsampl vector. However in each time step the order of the n points is random. Thus in each time step I pick a point in the last time step, find its nearest neighbor in the current time step and put it at the same position as the current point. Then I repeat that for all other n-1 points and go on to the next time step. This way every point in the previous step is associated with exactly one point in the new step (1:1 relation). My current implementation and an example are given below. However my implementation is terribly slow (takes about 10s for 10 x 4000 complex numbers). As I want to increase both, the set size n and the time frames nsampl this is really important to me. Is there a smarter way to implement this to gain some performance?
Example with n=3 and nsampl=2:
%manually create a test vector X
X=zeros(3,2); % zeros(n,nsampl)
X(:,1)=[1+1i; 2+2i; 3+3i];
X(:,2)=[2.1+2i; 5+5i; 1.1+1.1i]; % <-- this is my vector with complex numbers
%vector sort algorithm
for k=2:nsampl
Xlast=[real(X(:,k-1)) imag(X(:,k-1))]; % create vector with x/y-coords from last time step
Xcur=[real(X(:,k)) imag(X(:,k))]; % create vector with x/y-coords from current time step
for i=1:size(X,1) % loop over all n points
idx = knnsearch(Xcur(i:end,:),Xlast(i,:)); %find nearest neighbor to Xlast(i,:), but only use the points not already associated, thus Xcur(i:end,:) points
idx = idx + i - 1;
Xcur([i idx],:) = Xcur([idx i],:); %sort nearest neighbor to the same position in the vector as it was in the last time step
end
X(:,k) = Xcur(:,1)+1i*Xcur(:,2); %revert x/y coordinates to a complex number
end
Result:
X(:,2)=[1.1+1.1i; 2.1+2i; 5+5i];
Can anyone help me to speed up this code?
The problem you are tying to solve is combinatorial optimizartion which is solved by the hungarian algorithm (aka munkres). Luckily there is a implementation for matlab available for download. Download the file and put it either on your search path or next to your function. The code to use it is:
for k=2:size(X,2)
%build up a cost matrix, here cost is the distance between two points.
pairwise_distances=abs(bsxfun(#minus,X(:,k-1),X(:,k).'));
%let the algorithm find the optimal pairing
permutation=munkres(pairwise_distances);
%apply it
X(:,k)=X(permutation,k);
end

GPS Data time to distance base transformation

I am developing an application that logs a GPS trace over time.
After the trace is complete, I need to convert the time based data to distance based data, that is to say, where the original trace had a lon/lat record every second, I need to convert that into having a lon/lat record every 20 meters.
Smoothing the original data seems to be a well understood problem and I suppose I need something like a smoothing algorithm, but I'm struggling to think how to convert from a time based data set to a distance based data set.
This is an excellent question and what makes it so interesting is the data points should be assumed random. Which means you cannot expect a beginning to end data graph that represents a well behaved polynomial (like SINE or COS wave). So you will have to work in small increments such that values on your x-axis (so to speak) do not oscillate meaning Xn cannot be less than Xn-1. The next consideration would be the case of overlap or near overlap of data points. Imagine I’m recording my GPS coordinates and we have stopped to chat or rest and I walk randomly within a twenty five foot circle for the next five minutes. So the question would be how to ignore this type of “data noise”?
For simplicity let’s consider linear calculations where there is no approximation between two points; it’s a straight line. This will probably be more than sufficient for your calculations. Now given the comment above regarding random data points, you will want to traverse your data from your start point to the end point sequentially. Sequential termination occurs when you exceed the last data point or you have exceeded the overall distance to produce coordinates (like a subset). Let’s assume your plot precision is X. This would be your 20 meters. As you traverse there will be three conditions:
The distance between the two points is greater than your
precision. Therefore save the start point plus the precision X. This
will also become your new start point.
The distance between the two points is equal to your precision.
Therefore save the start point plus the precision X (or save end
point). This will also become your new start point.
The distance between the two points is less than your precision.
Therefore precision is adjusted to precision minus end point. The end
point will become your new start point.
Here is pseudo-code that might help get you started. Note, point y minus point x = distance between. And, point x plus value = new point on line between poing x and point y at distance value.
recordedPoints = received from trace;
newPlotPoints = emplty list of coordinates;
plotPrecision = 20
immedPrecision = plotPrecision;
startPoint = recordedPoints[0];
for(int i = 1; i < recordedPoints.Length – 1; i++)
{
Delta = recordedPoints[i] – startPoint;
if (immedPrecision < Delta)
{
newPlotPoints.Add(startPoint + immedPrecision);
startPoint = startPoint + immedPrecision;
immedPrecsion = plotPrecsion;
i--;
}
else if (immedPrecision = Delta)
{
newPlotPoints.Add(startPoint + immedPrecision);
startPoint = startPoint + immediatePrecision;
immedPrecision = plotPrecision;
}
else if (immedPrecision > Delta)
{
// Store last data point regardless
if (i == recordedPoints.Length - 1)
{
newPlotPoints.Add(startPoint + Delta)
}
startPoint = recordedPoints[i];
immedPrecision = Delta - immedPrecision;
}
}
Previously I mentioned "data noise". You can wrap the "if" and "else if's" in another "if" which detemines scrubs this factor. The easiest way is to ignore a data point if it has not moved a given distance. Keep in mind this magic number must be small enough such that sequentially recorded data points which are ignored don't sum to something large and valuable. So putting a limit on ignored data points might be a benefit.
With all this said, there are many ways to accurately perform this operation. One suggestion to take this subject to the next level is Interpolation. For .NET there is a open source library at http://www.mathdotnet.com. You can use their Numberics library which contains Interpolation at http://numerics.mathdotnet.com/interpolation/. If you choose such a route your next major hurdle will be deciding the appropriate Interpolation technique. If you are not a math guru here is a bit of information to get you started http://en.wikipedia.org/wiki/Interpolation. Frankly, Polynomial Interpolation using two adjacent points would be more than sufficient for your approximations provided you consider the idea of Xn is not < Xn-1 otherwise your approximation will be skewed.
The last item to note, these calculations are two-dimensional and do consider altitude (Azimuth) or the curvature of the earth. Here is some additional information in that regard: Calculate distance between two latitude-longitude points? (Haversine formula).
Never the less, hopefully this will point you in the correct direction. With no doubt this is not a trivial problem therefore keeping the data point range as small as possible while still being accurate will be to your benefit.
One other consideration might be to approximate using actual data points using the precision to disregard excessive data. Therefore you are not essentially saving two lists of coordinates.
Cheers,
Jeff

Generate random sequence of integers differing by 1 bit without repeats

I need to generate a (pseudo) random sequence of N bit integers, where successive integers differ from the previous by only 1 bit, and the sequence never repeats. I know a Gray code will generate non-repeating sequences with only 1 bit difference, and an LFSR will generate non-repeating random-like sequences, but I'm not sure how to combine these ideas to produce what I want.
Practically, N will be very large, say 1000. I want to randomly sample this large space of 2^1000 integers, but I need to generate something like a random walk because the application in mind can only hop from one number to the next by flipping one bit.
Use any random number generator algorithm to generate an integer between 1 and N (or 0 to N-1 depending on the language). Use the result to determine the index of the bit to flip.
In order to satisfy randomness you will need to store previously generated numbers (thanks ShreevatsaR). Additionally, you may run into a scenario where no non-repeating answers are possible so this will require a backtracking algorithm as well.
This makes me think of fractals - following a boundary in a julia set or something along those lines.
If N is 1000, use a 2^500 x 2^500 fractal bitmap (obviously don't generate it in advance - you can derive each pixel on demand, and most won't be needed). Each pixel move is one pixel up, down, left or right following the boundary line between pixels, like a simple bitmap tracing algorithm. So long as you start at the edge of the bitmap, you should return to the edge of the bitmap sooner or later - following a specific "colour" boundary should always give a closed curve with no self-crossings, if you look at the unbounded version of that fractal.
The x and y axes of the bitmap will need "Gray coded" co-ordinates, of course - a bit like oversized Karnaugh maps. Each step in the tracing (one pixel up, down, left or right) equates to a single-bit change in one bitmap co-ordinate, and therefore in one bit of the resulting values in the random walk.
EDIT
I just realised there's a problem. The more wrinkly the boundary, the more likely you are in the tracing to hit a point where you have a choice of directions, such as...
* | .
---+---
. | *
Whichever direction you enter this point, you have a choice of three ways out. Choose the wrong one of the other two and you may return back to this point, therefore this is a possible self-crossing point and possible repeat. You can eliminate the continue-in-the-same-direction choice - whichever way you turn should keep the same boundary colours to the left and right of your boundary path as you trace - but this still leaves a choice of two directions.
I think the problem can be eliminated by making having at least three colours in the fractal, and by always keeping the same colour to one particular side (relative to the trace direction) of the boundary. There may be an "as long as the fractal isn't too wrinkly" proviso, though.
The last resort fix is to keep a record of points where this choice was available. If you return to the same point, backtrack and take the other alternative.
While an algorithm like this:
seed()
i = random(0, n)
repeat:
i ^= >> (i % bitlen)
yield i
…would return a random sequence of integers differing each by 1 bit, it would require a huge array for backtracing to ensure uniqueness of numbers.
Further more your running time would increase exponentially(?) with increasing density of your backtrace, as the chance to hit a new and non-repeating number decreases with every number in the sequence.
To reduce time and space one could try to incorporate one of these:
Bloom Filter
Use a Bloom Filter to drastically reduce the space (and time) needed for uniqueness-backtracing.
As Bloom Filters come with the drawback of producing false positives from time to time a certain rate of falsely detected repeats (sic!) (which thus are skipped) in your sequence would occur.
While the use of a Bloom Filter would reduce the space and time your running time would still increase exponentially(?)…
Hilbert Curve
A Hilbert Curve represents a non-repeating (kind of pseudo-random) walk on a quadratic plane (or in a cube) with each step being of length 1.
Using a Hilbert Curve (on an appropriate distribution of values) one might be able to get rid of the need for a backtrace entirely.
To enable your sequence to get a seed you'd generate n (n being the dimension of your plane/cube/hypercube) random numbers between 0 and s (s being the length of your plane's/cube's/hypercube's sides).
Not only would a Hilbert Curve remove the need for a backtrace, it would also make the sequencer run in O(1) per number (in contrast to the use of a backtrace, which would make your running time increase exponentially(?) over time…)
To seed your sequence you'd wrap-shift your n-dimensional distribution by random displacements in each of its n dimension.
Ps: You might get better answers here: CSTheory # StackExchange (or not, see comments)

Resources