Algorithm for the distance among nearest vectors

Algorithm for the distance among nearest vectors - algorithm

Given n vectors of dimension m. For every vector, each dimension can be replaced by the other dimension value of this vector, and each value can only be used only one time to replace the other dimension. After the changing of all these n vectors, we calculate the Manhattan distance between each vector and its nearest vector. For all the replacing plans, we select the one which can get the minimum sum of the distance between all these n vectors and these nearest vectors.
Is it NP-hard?

Unless I'm missing something, the optimal configuration will always be to rearrange each row so that its entries are in ascending order. So the optimal runtime should be O(m n log m), the amount of time it takes to sort n lists of length m.

Related

maximum sum submatrix in higher dimensions

Given a matrix with integer elements the problem is to find the maximum sum submatrix. The problem is stated and solved here using Kadane's algorithm for a 2D matrix.
Now I want to solve this problem for higher dimensions i.e. given a matrix in d-dimensional space design an algorithm that solves the same problem.
I wonder if you can do it in O(n^(2d-1)) time.
Any idea is appreciated.

You can compute the sum of a d-dimensional submatrix with 2^d lookups, 2^d/2 subtractions and (2^d/2)-1 additions by using a multi-dimensional Summed-area table.
The summed-area table is a matrix with the same dimensionality and size as the input matrix, where each element in the summed-area table is the sum of all all elements in the input matrix with indexes equal to or lower than that element in all dimensions. It can be calculated with a single pass over the matrix.
You could then find the maximum sum submatrix in O(n^2d) by iterating over each dimension in both start index and submatrix size, and computing the submatrix sum for those start indexes and sizes using the summed-area table. Basically you look up all the "corners" of your submatrix in the SAT and add or subtract the value to get the submatrix sum. When d is odd then for each corner, if the corner has an odd number of dimensions where it is the end index of the submatrix range then you add it, and if it has an even number then you subtract it. Vice versa when d is even. Example in 2D below (from the SAT Wikipedia page)
The submatrix with the highest total is the maximum sum submatrix.
Using Kadane's algorithm could reduce the inner two loops (start index and submatrix size of one of the dimensions) into one, making it O(n^(2d-1))

find best k-means from a list of candidates

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

Find a region with maximum sum of top-K points

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).

You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?

An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

Sorting multidimensional vectors

If I have a set of k vectors of n dimensions, how can I sort these such that the distance between each consecutive pair of vectors is the minimal possible? The distance can be calculated by using the Euclidian distance, but how is the "sorting" then implemented in an effective manner?
I'm thinking one approach would be to select a vector at random, calculate the distance to all other vectors, pick the vector that minimizes the distance as the next vector and repeat until all vectors have been "sorted". However, this greedy search would probably render different results depending on which vector I start with.
Any ideas on how to do this?

If you really want just 'that the distance between each consecutive pair of vectors is the minimal possible' without randomness, you can firstly find 2 closest points (by O(n log n) algo like this) - let's say, p and q, then search for closest points for p (let's say, r) and q (let's say, s), then compare distance (p,r) and (q,s) and if the first is smaller, start with q,p,r and use your greedy algo (in other case, obviously, start with p,q,s).
However, if your goal is actually to arrange points so that the sum of all paired distances is smallest, you should choose any approximate solution for Travelling salesman problem. Note this trick in order to reduce your task to TSP.

Compressing coordinates in Fenwick tree

Let's say we have n empty boxes in a row. We are going to put m groups of coins in some consequtive boxes, which are known in advance. We put the 1st group of coins in boxes from i_1 to j_1, the 2nd group in boxes from i_2 to j_2 and so on.
Let be c_i number of coins in box i, after putting all the coins in the boxes. We want to be able to quickly determine, how many coins are there in the boxes with indexes i = s, s + 1, ... e - 1, e, i. e. we want to compute sum
c_s +c_(s+1) + ... + c_e
efficiently. This can be done by using Fenwick tree. Without any improvements, Fenwick tree needs O(n) space for storing c_i's (in a table; actually, tree[i] != c_i, values are stored smarter) and O(log n) time for computing the upper sum.
If we have the case where
n is too big for us to make a table of length n (let's say ~ 10 000 000 000)
m is sufficiently small (let's say ~ 500 000)
there is a way to somehow compress coordinates (indexes) of the boxes, i.e. it suffices to store just boxes with indexes i_1, i_2, ... , i_m. Since a value that is stored in tree[i] depends on binary representation of i, my idea is to sort indexes i_1, j_1, i_2, j_2, ... , i_m, j_m and make a tree with length O(m). Adding a new value to the tree would then be straight forward. Also, to compute that sum, we only have to find the first index that is not greater than e and the last that is not smaller than s. Both can be done with binary search. After that the sum can be easily computed.
Problem occurs in 2D case. Now, we have an area of points (x,y) in the plane, 0 < x,y < n. There are m rectangles in that area. We know coordinates of their down-left and up-right corners and we want to compute how many rectangles contain a point (a,b). The simplest (and my only) idea is to follow the manner from the 1D case: for each coordinate x_i of corners store all the coordinates y_i of the corners. The idea is not so clever, since it needs O(m^2) = too much space. My question is
How to store coordinates in the tree in a more efficient way?
Solutions of the problem that use Fenwick trees are preferred, but every solution is welcome!

The easiest approach is using map/unordered_map instead of 2d array. In that case you even have no need in coordinates compression. Map will create a key-value pair only when it needed, so it creates log^2(n) key-value pairs for each point from input.
Also you could you segment tree based on pointers (instead of arrays) with lazy initialisation (you should create node only when it needed).
Use 2d Segment Tree. It could be noticed that for each canonical segment by y-coordinate you can build segment tree (1d) for x-coordinates only for points lying in zone y_min <= y < y_max, where y_min and y_max are bounds of the canonical segment by y. It implies that each input point will be only in log(n) segment trees for x-coordinates, which makes O(n log n) memory in total.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio