Searching the k nearest elements [duplicate] - algorithm

This question already has answers here:
Finding K-nearest neighbors and its implementation
(1 answer)
Efficiently compute pairwise squared Euclidean distance in Matlab
(2 answers)
Closed 5 years ago.
I have a bunch (more or less 3500) of vectors with 4096 components and I need a fast method to see, given an input of another vector with the same length, which are the nearest N.
I would like to use some matlab functions to do that. Is this ok for what I need?
https://uk.mathworks.com/help/stats/classificationknn-class.html

What you are suggesting is a clustering function, which should make N clusters out of all your vectors. Not sure this is what you want. If you simply want N minimum distances between the bunch of vectors, you can do it manually easy enough. Something like:
distances = matrixOfvectors - yourVector; % repmat(your...) if you have older Matlab.
[val, pos] = sort(sum(distances.^2, 2)); % Sum might need 1 instead of 2, depends whether vectors are rows or columns.
minVectors = pos(1:N); % Take indices of N nearest to get which vectors are the closest.
If N is small, say 3 or less, it would be slightly faster to avoid sort and just simply compare each new vector with 2nd biggest first, then with 1st or 3rd depending on the outcome.

Related

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Compare two lists of distances (floats) in python

I'm looking to compare two lists of distances (floats) in python. The distances represent how far away my robot is from a wall at different angles. One array is my "best guess" distance array and the other is the array of actual distances. I need to return a number between [0, 1] that represents the similarity between these two lists of floats. The distances match up 1 to 1. That is, the distance at index 0 should be compared to the distance at index 0 in the other array. Right now, for each index, I am dividing the smaller number by the larger number to get a percentage difference. Then I am taking the average of these percentage differences (total percentage difference / number of entries in the array) to get a number between 0 and 1. However, my approach does not seem to be accurate enough. Is there a better algorithm for comparing two ordered lists of floats?
It looks like you need a normalized Euclidean distance between two vectors.
It is simple to caclulate and you can read more about it here.

Algorithm for finding all combinations of (x,y,z,j) that satisfy w+x = y+j, where w,x,y,j are integers between -N...N inclusive

I'm working on a problem that requires an array (dA[j], j=-N..N) to be calculated from the values of another array (A[i], i=-N..N) based on a conservation of momentum rule (x+y=z+j). This means that for a given index j for all the valid combinations of (x,y,z) I calculate A[x]A[y]A[z]. dA[j] is equal to the sum of these values.
I'm currently precomputing the valid indices for each dA[j] by looping x=-N...+N,y=-N...+N and calculating z=x+y-j and storing the indices if abs(z) <= N.
Is there a more efficient method of computing this?
The reason I ask is that in future I'd like to also be able to efficiently find for each dA[j] all the terms that have a specific A[i]. Essentially to be able to compute the Jacobian of dA[j] with respect to dA[i].
Update
For the sake of completeness I figured out a way of doing this without any if statements: if you parametrize the equation x+y=z+j given that j is a constant you get the equation for a plane. The constraint that x,y,z need to be integers between -N..N create boundaries on this plane. The points that define this boundary are functions of N and j. So all you have to do is loop over your parametrized variables (s,t) within these boundaries and you'll generate all the valid points by using the vectors defined by the plane (s*u + t*v + j*[0,0,1]).
For example, if you choose u=[1,0,-1] and v=[0,1,1] all the valid solutions for every value of j are bounded by a 6 sided polygon with points (-N,-N),(-N,-j),(j,N),(N,N),(N,-j), and (j,-N).
So for each j, you go through all (2N)^2 combinations to find the correct x's and y's such that x+y= z+j; the running time of your application (per j) is O(N^2). I don't think your current idea is bad (and after playing with some pseudocode for this, I couldn't improve it significantly). I would like to note that once you've picked a j and a z, there is at most 2N choices for x's and y's. So overall, the best algorithm would still complete in O(N^2).
But consider the following improvement by a factor of 2 (for the overall program, not per j): if z+j= x+y, then (-z)+(-j)= (-x)+(-y) also.

how to apply dynamic programming in finding the minimum cost to create the tower in a field

you are given an N X M rectangular field with bottom left point at the origin. You have to construct a tower with square base in the field. There are trees in the field with associated cost to uproot them. So you have to minimize the number of trees uprooted to minimize the cost of constructing the tower.
Example Input:
N = 4
M = 3
Lenght of side of Tower = 1
Number of Trees in the field = 4
1 3 5
3 3 4
2 2 1
2 1 2
The 4 rows in the Input are the coordinates of the tree with cost for uprooting as the third integer.
Tree coinciding with the edge of the tower is considered as placed inside the tower and have to be uprooted as well.
I'm facing problem in formulating the Dynamic Programming relation for this problem
thanks
It sounds like your problem boils down to: find the KxK subblock of an MxN matrix with the smallest sum. You can solve this problem efficiently (proportional to the size of your input) by using an integral transform. Of course, this doesn't necessarily help you with your dynamic programming issue -- I'm not sure this solution is equivalent to any dynamic programming formulation....
At any rate, for each index pair (a,b) of your original matrix M, compute an "integral transform" matrix I[a,b] = sum[i<=a, j<=b](M[i,j]). This is computable by traversing the matrix in order, referring to the value computed from the previous row/column. (with a bit of thought, you can also do this efficiently with a sparse matrix)
Then, you can compute the sum of any subblock (a1..a2, b1..b2) in constant time as I[a2,b2] - I[a1-1,b2] - I[a2,b1-1] + I[a1-1,b1-1]. Iterating through all KxK subblocks to find the smallest sum will then take time proportional to the size of your original matrix also.
Since the original problem is phrased as a list of integral coordinates (and, presumably, expects the tower location to be output as an integral coordinate pair), you likely do need to represent your field as a sparse matrix for an efficient solution -- this involves sorting your trees' coordinates in lexicographic order (e.g. first by x-coordinate, then by y-coordinate). Note that this sorting step may take O(L log L) for input of size L, dominating the following steps, which take only O(L) in the size of the input.
Also note that, due to the problem specifying that "trees coinciding with the edge of the tower are uprooted...", a tower with edge length K actually corresponds to an (K+1)x(K+1) subblock.

permuting the rows and columns of a matrix for clustering [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have a distance matrix that is 1000x1000 in dimension and symmetric with 0s along the diagonal. i want to form groupings of distances (clusters) by simultaneously reordering the rows and columns of the matrix. this is like reordering a matrix before you visualize its clusters with a heatmap. i feel like this should be an easy problem, but i am not having much luck finding code that does the permutations online. can anyone help?
Here is one approach that came to mind:
"Sparsify" the matrix so that only "sufficiently close" neighbors have a nonzero value in the matrix.
Use a Cuthill-McKee algorithm to compress the bandwidth of the sparse matrix.
Do a symmetric reordering of the original matrix using the results from Step 2.
Example
I will use Octave (everything I am doing should also work in Matlab) since it has a Reverse Cuthill-McKee (RCM) implementation built in.
First, we need to generate a distance matrix. This function creates a random set of points and their distance matrix:
function [x, y, A] = make_rand_dist_matrix(n)
x = rand(n, 1);
y = rand(n, 1);
A = sqrt((repmat(x, 1, n) - repmat(x', n, 1)).^2 +
(repmat(y, 1, n) - repmat(y', n, 1)).^2);
end
Let's use that to generate and visualize a 100-point example.
[x, y, A] = make_rand_dist_matrix(100);
surf(A);
Viewing the surface plot from above gets the image below (yours will be different, of course).
Warm colors represent greater distances than cool colors. Row (or column, if you prefer) i in the matrix contains the distances between point i and all points. The distance between point i and point j is in entry A(i, j). Our goal is to reorder the matrix so that the row corresponding to point i is near rows corresponding to points a short distance from i.
A simple way to sparsify A is to make all entries greater than some threshold zero, and that is what is done below, although more sophisticated approaches may prove more effective.
B = A < 0.2; % sparsify A -- only values less than 0.2 are nonzeros in B
p = symrcm(B); % compute reordering by Reverse Cuthill-McKee
surf(A(p, p)); % visualize reordered distance matrix
The matrix is now ordered in a way that brings nearby points closer together in the matrix. This result is not optimal, of course. Sparse matrix bandwidth compression is computed using heuristics, and RCM is a very simple approach. As I mentioned above, more sophisticated approaches for producing the sparse matrix may give better results, and different algorithms may also yield better results for the problem.
Just for Fun
Another way to look at what happened is to plot the points and connect a pair of points if their corresponding rows in the matrix are adjacent. Your goal is to have the lines connecting pairs of points that are near each other. For a more dramatic effect, we use a larger set of points than above.
[x, y, A] = make_rand_dist_matrix(2000);
plot(x, y); % plot the points in their initial, random order
Clearly, connections are all over the place and are occurring over a wide variety of distances.
B = A < 0.2; % sparsify A
p = symrcm(B);
plot(x(p), y(p)) % plot the reordered points
After reordering, the connections tend to be over much smaller distances and much more orderly.
Two Matlab functions do this: symrcm and
symamd.
Note that there is no unique solution to this problem. Clustering is another approach.

Resources