Matlab nearest neighbor / track points - performance

I have a set of n complex numbers that move through the complex plane from time step 1 to nsampl . I want to plot those numbers and their trace over time (y-axis shows imaginary part, x-axis the real part). The numbers are stored in a n x nsampl vector. However in each time step the order of the n points is random. Thus in each time step I pick a point in the last time step, find its nearest neighbor in the current time step and put it at the same position as the current point. Then I repeat that for all other n-1 points and go on to the next time step. This way every point in the previous step is associated with exactly one point in the new step (1:1 relation). My current implementation and an example are given below. However my implementation is terribly slow (takes about 10s for 10 x 4000 complex numbers). As I want to increase both, the set size n and the time frames nsampl this is really important to me. Is there a smarter way to implement this to gain some performance?
Example with n=3 and nsampl=2:
%manually create a test vector X
X=zeros(3,2); % zeros(n,nsampl)
X(:,1)=[1+1i; 2+2i; 3+3i];
X(:,2)=[2.1+2i; 5+5i; 1.1+1.1i]; % <-- this is my vector with complex numbers
%vector sort algorithm
for k=2:nsampl
Xlast=[real(X(:,k-1)) imag(X(:,k-1))]; % create vector with x/y-coords from last time step
Xcur=[real(X(:,k)) imag(X(:,k))]; % create vector with x/y-coords from current time step
for i=1:size(X,1) % loop over all n points
idx = knnsearch(Xcur(i:end,:),Xlast(i,:)); %find nearest neighbor to Xlast(i,:), but only use the points not already associated, thus Xcur(i:end,:) points
idx = idx + i - 1;
Xcur([i idx],:) = Xcur([idx i],:); %sort nearest neighbor to the same position in the vector as it was in the last time step
end
X(:,k) = Xcur(:,1)+1i*Xcur(:,2); %revert x/y coordinates to a complex number
end
Result:
X(:,2)=[1.1+1.1i; 2.1+2i; 5+5i];
Can anyone help me to speed up this code?

The problem you are tying to solve is combinatorial optimizartion which is solved by the hungarian algorithm (aka munkres). Luckily there is a implementation for matlab available for download. Download the file and put it either on your search path or next to your function. The code to use it is:
for k=2:size(X,2)
%build up a cost matrix, here cost is the distance between two points.
pairwise_distances=abs(bsxfun(#minus,X(:,k-1),X(:,k).'));
%let the algorithm find the optimal pairing
permutation=munkres(pairwise_distances);
%apply it
X(:,k)=X(permutation,k);
end

Related

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Algorithm to find positions in a game board i can move to

The problem i have goes as follows (simplified):
I have a board, represented as a matrix of n x m squares (n might equal m)
In it, there are p game pieces
Each game piece has a pre-defined speed, which is how many steps it can take in it's turn
Pieces can't overlap
There are three types of cells: those which don't require extra movements to be crossed (you loose 0 extra speed when going through), those which require 1 extra movement to be crossed and some which you simply can't get through (like a wall)
So, given a game piece in a certain [i,j] position in my game board, i want to find out:
a) All the places it can move to, with it's speed
b) The path to a certain [k,l] position in the board
Having a) solved, b) is almost trivial.
Currently the algorithm i'm using goes as follows, assuming a language where arrays of size n go from 0 to n-1:
Create a sqaure matrix of speed*2+1 size which represents the cost of moving as if all cells had no extra cost to be crossed (the piece is on the position [speed, speed])
Create another square matrix of speed*2+1 size which has the extra costs of each cell (those which can't be crossed because either it's a wall or there is another piece in it has a value of infinite)(the piece is on the position [speed, speed])
Create another square matrix of speed*2+1 size which is the sum of the former two(the piece is on the position [speed, speed])
Correct the latter matrix making sure the value of each cell is: the minimal cost of all the adjacent cells + 1 + the extra cost of the cell. If it isn't, i correct it and start with the matrix all over again.
An example:
P are pieces, W are walls, E are empty cells which require no extra movement, X are cells which require 1 extra movement to be crossed.
X,E,X,X,X
X,X,X,X,X
W,E,E,E,W
W,E,X,E,W
E,P,P,P,P
The first matrix:
2,2,2,2,2
2,1,1,1,2
2,1,0,1,2
2,1,1,1,2
2,2,2,2,2
The second matrix:
1,0,1,inf,1
1,1,1,1,1
inf,0,0,0,inf
inf,0,1,0,inf
0,inf,inf,inf,inf
The sum:
3,2,3,3,3
3,2,2,2,3
inf,1,0,1,inf
inf,1,2,1,inf
inf,inf,inf,inf,inf
Since [0,0] is not 2+1+1, i correct it:
The sum:
4,2,3,3,3
3,2,2,2,3
inf,1,0,1,inf
inf,1,2,1,inf
inf,inf,inf,inf,inf
Since [0,1] is not 2+1+0, i correct it:
The sum:
4,3,3,3,3
3,2,2,2,3
inf,1,0,1,inf
inf,1,2,1,inf
inf,inf,inf,inf,inf
Since [0,2] is not 2+1+1, i correct it:
The sum:
4,2,4,3,3
3,2,2,2,3
inf,1,0,1,inf
inf,1,2,1,inf
inf,inf,inf,inf,inf
Which one is the correct answer?
What I want to know is if this problem has a name I can search it by (couldn't find anything) or if anybody can tell me how to solve the point a).
Note that I want the optimal solution, so I went with a dynamic programming algorithm. Might random walkers be better? AFAIK, this solution is not failing (yet), but I have no proof of correctness for it, and I want to be sure it works.
A-star is a standard algorithm to determine shortest path give obstacles on a 2d board and cost per square of moving. You can also use it to test if a specific move is valid, but to actually generate all valid moves I would simply start ay the start position, move in each direction by one square mark which squares are valid and then repeat from each of your new places making sure not to visit the same square again. It will be a recursive algorithm calling itself at most 4 times on each call and will generate you valid moves efficiently. If there are constraints like how many squares you can move at once with different costs just pass the running total of how far you've come for each square.

Algorithm for finding path combinations?

Imagine you have a dancing robot in n-dimensional euclidean space starting at origin P_0 = (0,0,...,0).
The robot can make m types of dance moves D_1, D_2, ..., D_m
D_i is an n-vector of integers (D_i_1, D_i_2, ..., D_i_n)
If the robot makes dance move i than its position changes by D_i:
P_{t+1} = P_t + D_i
The robot can make any of the dance moves as many times as he wants and in any order.
Let a k-dance be defined as a sequence of k dance moves.
Clearly there are m^k possible k-dances.
We are interested to know the set of possible end positions of a k-dance, and for each end position, how many k-dances end at that location.
One way to do this is as follows:
P0 = (0, 0, ..., 0);
S[0][P0] = 1
for I in 1 to k
for J in 1 to m
for P in S[I-1]
S[I][P + D_J] += S[I][P]
Now S[k][Q] will tell you how many k-dances end at position Q
Assume that n, m, |D_i| are small (less than 5) and k is less than 40.
Is there a faster way? Can we calculate S[k][Q] "directly" somehow with some sort of linear algebra related trick? or some other approach?
You could create an adjacency matrix that would contain dance-move transitions in your space (the part of it that's reachable in k moves, otherwise it would be infinite). Then, the P_0 row of n-th power of this matrix contains the S[k] values.
The matrix in question quickly gets enormous, something like (k*(max(D_i_j)-min(D_i_j)))^n (every dimension can be halved if Q is close to origin), but that's true for your S matrix as well
Since dance moves are interchangable you can assume that for a i < j the robot first makes all the D_i moves before the D_j moves, thus reducing the number of combinations to actually calculate.
If you keep track of the number of times each dance move was made calculating the total number of combinations should be easy.
Since the 1-dimensional problem is closely related to the subset sum problem, you could probably take a similar approach - find all of the combinations of dance vectors that add together to have the correct first coordinate with exactly k moves; then take that subset of combinations and check to see which of those have the right sum for the second, and take the subset which matches both and check it for the third, and so on.
In this way, you get to at least only have to perform a very simple addition for the extremely painful O(n^k) step. It will indeed find all of the vectors which will hit a given value.

Partition a set into k groups with minimum number of moves

You have a set of n objects for which integer positions are given. A group of objects is a set of objects at the same position (not necessarily all the objects at that position: there might be multiple groups at a single position). The objects can be moved to the left or right, and the goal is to move these objects so as to form k groups, and to do so with the minimum distance moved.
For example:
With initial positions at [4,4,7], and k = 3: the minimum cost is 0.
[4,4,7] and k = 2: minimum cost is 0
[1,2,5,7] and k = 2: minimum cost is 1 + 2 = 3
I've been trying to use a greedy approach (by calculating which move would be shortest) but that wouldn't work because every move involves two elements which could be moved either way. I haven't been able to formulate a dynamic programming approach as yet but I'm working on it.
This problem is a one-dimensional instance of the k-medians problem, which can be stated as follows. Given a set of points x_1...x_n, partition these points into k sets S_1...S_k and choose k locations y_1...y_k in a way that minimizes the sum over all x_i of |x_i - y_f(i)|, where y_f(i) is the location corresponding of the set to which x_i is assigned.
Due to the fact that the median is the population minimizer for absolute distance (i.e. L_1 norm), it follows that each location y_j will be the median of the elements x in the corresponding set S_j (hence the name k-medians). Since you are looking at integer values, there is the technicality that if S_j contains an even number of elements, the median might not be an integer, but in such cases choosing either the next integer above or below the median will give the same sum of absolute distances.
The standard heuristic for solving k-medians (and the related and more common k-means problem) is iterative, but this is not guaranteed to produce an optimal or even good solution. Solving the k-medians problem for general metric spaces is NP-hard, and finding efficient approximations for k-medians is an open research problem. Googling "k-medians approximation", for example, will lead to a bunch of papers giving approximation schemes.
http://www.cis.upenn.edu/~sudipto/mypapers/kmedian_jcss.pdf
http://graphics.stanford.edu/courses/cs468-06-winter/Papers/arr-clustering.pdf
In one dimension things become easier, and you can use a dynamic programming approach. A DP solution to the related one-dimensional k-means problem is described in this paper, and the source code in R is available here. See the paper for details, but the idea is essentially the same as what #SajalJain proposed, and can easily be adapted to solve the k-medians problem rather than k-means. For j<=k and m<=n let D(j,m) denote the cost of an optimal j-medians solution to x_1...x_m, where the x_i are assumed to be in sorted order. We have the recurrence
D(j,m) = min (D(j-1,q) + Cost(x_{q+1},...,x_m)
where q ranges from j-1 to m-1 and Cost is equal to the sum of absolute distances from the median. With a naive O(n) implementation of Cost, this would yield an O(n^3k) DP solution to the whole problem. However, this can be improved to O(n^2k) due to the fact that the Cost can be updated in constant time rather than computed from scratch every time, using the fact that, for a sorted sequence:
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_1...x_h)-x_1 if h is odd
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_2...x_h)-x_1 if h is even
See the writeup for more details. Except for the fact that the update of the Cost function is different, the implementation will be the same for k-medians as for k-means.
http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf
as I understand, the problems is:
we have n points on a line.
we want to place k position on the line. I call them destinations.
move each of n points to one of the k destinations so the sum of distances is minimum. I call this sum, total cost.
destinations can overlap.
An obvious fact is that for each point we should look for the nearest destinations on the left and the nearest destinations on the right and choose the nearest.
Another important fact is all destinations should be on the points. because we can move them on the line to right or to left to reach a point without increasing total distance.
By these facts consider following DP solution:
DP[i][j] means the minimum total cost needed for the first i point, when we can use only j destinations, and have to put a destination on the i-th point.
to calculate DP[i][j] fix the destination before the i-th point (we have i choice), and for each choice (for example k-th point) calculate the distance needed for points between the i-th point and the new point added (k-th point). add this with DP[k][j - 1] and find the minimum for all k.
the calculation of initial states (e.g. j = 1) and final answer is left as an exercise!
Task 0 - sort the position of the objects in non-decreasing order
Let us define 'center' as the position of the object where it is shifted to.
Now we have two observations;
For N positions the 'center' would be the position which is nearest to the mean of these N positions. Example, let 1,3,6,10 be the positions. Then mean = 5. Nearest position is 6. Hence the center for these elements is 6. This gives us the position with minimum cost of moving when all elements need to be grouped into 1 group.
Let N positions be grouped into K groups "optimally". When N+1 th object is added, then it will disturb only the K th group, i.e, first K-1 groups will remain unchanged.
From these observations, we build a dynamic programming approach.
Let Cost[i][k] and Center[i][k] be two 2D arrays.
Cost[i][k] = minimum cost when first 'i' objects are partitioned into 'k' groups
Center[i][k] stores the center of the 'i-th' object when Cost[i][k] is computed.
Let {L} be the elements from i-L,i-L+1,..i-1 which have the same center.
(Center[i-L][k] = Center[i-L+1][k] = ... = Center[i-1][k]) These are the only objects that need to be considered in the computation for i-th element (from observation 2)
Now
Cost[i][k] will be
min(Cost[i-1][k-1] , Cost[i-L-1][k-1] + computecost(i-L, i-L+1, ... ,i))
Update Center[i-L ... i][k]
computecost() can be found trivially by finding the center (from observation 1)
Time Complexity:
Sorting O(NlogN)
Total Cost Computation Matrix = Total elements * Computecost = O(NK * N)
Total = O(NlogN + N*NK) = O(N*NK)
Let's look at k=1.
For k=1 and n odd, all points should move to the center point. For k=1 and n even, all points should move to either of the center points or any spot between them. By 'center' I mean in terms of number of points to either side, i.e. the median.
You can see this because if you select a target spot, x, with more points to its right than it's left, then a new target 1 to the right of x would result in a cost reduction (unless there is exactly one more point to the right than the left and the target spot is a point, in which case n is even and the target is on/between the two center points).
If your points are already sorted, this is an O(1) operation. If not, I believe it's O(n) (via an order statistic algorithm).
Once you've found the spot that all points are moving to, it's O(n) to find the cost.
Thus regardless of whether the points are sorted or not, this is O(n).

finding peaks and troughs, Part II (with corresponding definition)

this is an update to the previous question that I had about locating peaks and troughs. The previous question was this:
peaks and troughs in MATLAB (but with corresponding definition of a peak and trough)
This time around, I did the suggested answer, but I think there is still something wrong with the final algorithm. Can you please tell me what I did wrong in my code? Thanks.
function [vectpeak, vecttrough]=peaktroughmodified(x,cutoff)
% This function is a modified version of the algorithm used to identify
% peaks and troughs in a series of prices. This will be used to identify
% the head and shoulders algorithm. The function gives you two vectors:
% PEAKS - an indicator vector that identifies the peaks in the function,
% and TROUGHS - an indicator vector that identifies the troughs of the
% function. The input is the vector of exchange rate series, and the cutoff
% used for refining possible peaks and troughs.
% Finding all possible peaks and troughs of our vector.
[posspeak,possploc]=findpeaks(x);
[posstrough,posstloc]=findpeaks(-x);
posspeak=posspeak';
posstrough=posstrough';
% Initialize vector of peaks and troughs.
numobs=length(x);
prelimpeaks=zeros(numobs,1);
prelimtroughs=zeros(numobs,1);
numpeaks=numel(possploc);
numtroughs=numel(posstloc);
% Indicator for possible peaks and troughs.
for i=1:numobs
for j=1:numpeaks
if i==possploc(j);
prelimpeaks(i)=1;
end
end
end
for i=1:numobs
for j=1:numtroughs
if i==posstloc(j);
prelimtroughs(i)=1;
end
end
end
% Vector that gives location.
location=1:1:numobs;
location=location';
% From the list of possible peaks and troughs, find the peaks and troughs
% that fit Chang and Osler [1999] definition.
% "A peak is a local minimum at least x percent higher than the preceding
% trough, and a trough is a local minimum at least x percent lower than the
% preceding peak." [Chang and Osler, p.640]
% cutoffs
peakcutoff=1.0+cutoff; % cutoff for peaks
troughcutoff=1.0-cutoff; % cutoff for troughs
% First peak and first trough are initialized as previous peaks/troughs.
prevpeakloc=possploc(1);
prevtroughloc=posstloc(1);
% Initialize vectors of final peaks and troughs.
vectpeak=zeros(numobs,1);
vecttrough=zeros(numobs,1);
% We first check whether we start looking for peaks and troughs.
for i=1:numobs
if prelimpeaks(i)==1;
if i>prevtroughloc;
ratio=x(i)/x(prevtroughloc);
if ratio>peakcutoff;
vectpeak(i)=1;
prevpeakloc=location(i);
else vectpeak(i)=0;
end
end
elseif prelimtroughs(i)==1;
if i>prevpeakloc;
ratio=x(i)/x(prevpeakloc);
if ratio<troughcutoff;
vecttrough(i)=1;
prevtroughloc=location(i);
else vecttrough(i)=0;
end
end
else
vectpeak(i)=0;
vecttrough(i)=0;
end
end
end
I just ran it, and it seems to work if you make this change:
peakcutoff= 1/cutoff; % cutoff for peaks
troughcutoff= cutoff; % cutoff for troughs
I tested it with the following code, with a cutoff of 0.1 (peaks must be 10 times larger than troughs), and it looks reasonable
x = randn(1,100).^2;
[vectpeak,vecttrough] = peaktroughmodified(x,0.1);
peaks = find(vectpeak);
troughs = find(vecttrough);
plot(1:100,x,peaks,x(peaks),'o',troughs,x(troughs),'o')
I strongly urge you to read up on vectorization in matlab. There are many wasted lines in your program, and it makes it difficult to read and will also make it very slow with big datasets. For instance, prelimpeaks and prelimtroughs can be completely defined without loops, in a single line for each:
prelimpeaks(possploc) = 1;
prelimtroughs(posstloc) = 1;
I think there are better techniques for finding peaks and troughs than the percentage threshold technique given above. Fit the least squares fit parabola to the data set, a technique for doing this is in the 1946 Frank Peters paper, "Parabolic Correlation, a New Descrptive Statistic." The fitted parabola will likely have an index of curvature, as Peters defines it. Find peaks and troughs by testing which points, when eliminated, minimize the absolute value of the index of curvature of the parabola. Once these points are discovered, test for which are peaks and which are troughs by considering how the index of curvature changes when the point is excluded, which will depend on whether the original parabola had a positive or negative index of curvature. If you become concerned about contiguous points the elimination of which achieves the minimum absolute value curvature, constrain by setting a minimum distance the identified points must be from each other. Another constraint would have to be the number of points identified. Without this constraint, this algorithm would remove all but two points, a straight line without curvature.
Sometimes there are steep changes between contiguous points and both should be included in extreme points. Perhaps a percentage threshold test for contiguous points that overrides the minimum distance constraint would be useful.
Another solution might be to compute the Fast Fourier Transform of the series and remove points that minimize the lower spectra. FFT functions are more readily available than code that finds least square fits parabola. There is a matrix manipulation technique for determining the least square fit parabola that is easier to manage than Peter's approach. I saw it documented on the web someplace, but lost the link. Advice from anybody able to arrive at a least square fit parabola using matrix vector notation would be appreciated.

Resources