Minimum sum of squared Euclidean distance between two arrays - algorithm

Question:
Given two sorted sequences in increasing order, X and Y.
Y is of size k and X is of size m.
I would like to take a subset of X i.e. X' of size k, and consider the following optimization problem:
d(Y,X') = sum ( (Y[j] - X'[j])^2) for j in [1,k]:;
X' is the subset of X with size k,
Y[j] and X'[j] is element in Y and X',
I want to find the subset of X, to reach the minimum of d(Y, X').
d(Y, X') is the sum of the squared distance between elements in Y and X'.
Note that X' could have k! numbers of rearrangements, so its order is totally unknown.
I want to utilize DP approach to solve this problem.
My thought so far:
I plan to go over all the element in Y and compute their squared distance with each element in X. But I'm totally at lost with what to do next.

Related

Find a subset of points such xs sum at most A and ys sum at least B

I'm solving a problem and I reduced it to the following one:
Given n points (x1, y1), ..., (xn, yn) (0 <= xi, yi), and two integers A and B (0 <= A, B) I need to find a subset of those points such that:
1) The sum of the x values is at most A (SUM(x) <= A).
2) The sum of the y values is at least B (SUM(y) >= B).
I'm struggling with this problem and I cannot find a solution other than combining. I would appreciate some ideas.
UPD: xi and A are rationals (represented as floats/double). yi and B are integers.
Depending on the distribution and the space available, we could create a "merge sort" segment tree, where the segments divide the X range and the tree array-nodes are sorted by y values. Then test partitions of A such that we can choose multiple xs maxed by the right-bound of each part, choosing the top y values in O(log n) from the segment tree.

Find largest square submatrix and largest rectangular submatrix in a sorted nxn matrix where the submatrix contains elements between x and y

You are given a nxn matrix of integers such that the entries of row are sorted in increasing order from left to right and entries of each column are sorted in increasing order from top to bottom like below.( I have included the matrix as image below)
Now given x and y with x < y I need to compute the largest square submatrix which contains numbers greater than x and less than y in O(n) complexity. For example x=7 and y=50, the following 3x3 matrix is the answer.(Included below as image)
Now given x and y with x < y I need to compute the largest square submatrix which contains numbers greater than x and less than y in O(nlogn) complexity. For example x=7 and y=52, the following 3x4 matrix is the answer.(Included below as image)
Please help me in solving the questions.

Data structure to hold and retrieve points in a plane

Definition 1: Point (x,y) is controlling point (x',y') if and only if x < x' and y < y'.
Definition 2: Point (x,y) is controlled by point (x',y') if and only if x' < x and y' < y.
I'm trying to come up with data structure to support the following operations:
Add(x,y) - Adds a point (x,y) to the system in O(logn) complexity, where n is the number of points in the system.
Remove(x,y) - Removes a point (x,y) from the system in O(logn) complexity, where n is the number of points in the system.
Score(x,y) - Returns the number of points (x,y) controls - number of points that (x,y) is controlled by. Worst case complexity O(logn).
I've tried to solve it using multiple AVL trees, but could not come up with elegant enough solution.
Point (x,y) is controlling point (x',y') if and only if x < x' and y <
y'.
Point (x,y) is controlled by point (x',y') if and only if x' < x and
y' < y.
Lets assume that (x,y) is the middle of the square.
(x,y) is controlling points in B square and is being controlled by points in C.
The output required is the number of points (x,y) controls minus the number of points (x,y) is being controlled by. Which is the number of points in B minus the number of points in C,B-C(Referring to the number of points in A,B,C,D as simply A,B,C,D).
We can easily calculate the number of points in A+C, that's simply the number of points with x' < x.
Same goes for C+D (Points with y'y), B+D (x'>x).
We add up A+C to C+D which is A+2C+D.
Add up A+B to B+D which is A+2B+D.
Deduct the two: A+2B+D-(A+2C+D) = 2B-2C, divide by two: (2B-2C)/2 = B-C which is the output needed.
(I'm assuming handling the 1D case is simple enough and there is no need to explain.)
For the sake of future reference
Solution outline:
We will maintain two AVL trees.
Tree_X: will hold points sorted by their X coordinate.
Tree_Y: will hold points sorted by their Y coordinate.
Each node within both trees will hold the following additional data:
Number of leaves in left sub-tree.
Number of leaves in right sub-tree.
For a point $(x,y)$ we will define regions A ,B, C, D:
Point (x',y') is in A if x' < x and y' > y.
Point (x',y') is in B if x' > x and y' > y.
Point (x',y') is in C if x' < x and y' < y.
Point (x',y') is in D if x' > x and y' < y.
Now it is clear that Score(x,y) = |C|-|B|.
However |A|+|C|, |B|+|D|, |A|+|B|, |C|+|D| could be easily retrieved from our two AVL trees, as we will soon see.
And notice that [(|A| + |C| + |C| + |D|) - (|A| + |B| + |B| + |D|)]/2 = |C|-|B|
Implementation of required operations:
Add(x,y) - We will add point (x,y) to both of our AVL trees. Since the additional data we are storing is affected only on the insertion path and since the insertion occurs in (logn), the total cost of Add(x,y) is O(logn).
Remove(x,y) - We will remove point (x,y) from both of our AVL trees. Since the additional data we are storing is affected only on the removal path and since the removal occurs in (logn), the total cost of Remove(x,y) is O(logn).
Score(x,y) - I will show how to calculate $|B|+|D|$ as others done in similar way and same complexity costs. It is clear that $|B|+|D|$ is the number of points which satisfy $x' > x$. To calculate this number we will:
Find x in AVL_X. Complexity O(logn).
Go upwards in Tree_X until the root and on each turn right we will sum the number of elements in left sub-tree of the son. Complexity O(logn).
Total cost of Remove(x,y) is O(logn).

Computational Geometry set of points algorithm

I have to design an algorithm with running time O(nlogn) for the following problem:
Given a set P of n points, determine a value A > 0 such that the shear transformation (x,y) -> (x+Ay,y) does not change the order (in x direction) of points with unequal x-coordinates.
I am having a lot of difficulty even figuring out where to begin.
Any help with this would be greatly appreciated!
Thank you!
I think y = 0.
When x = 0, A > 0
(x,y) -> (x+Ay,y)
-> (0+(A*0),0) = (0,0)
When x = 1, A > 0
(x,y) -> (x+Ay,y)
-> (1+(A*0),0) = (1,0)
with unequal x-coordinates, (2,0), (3,0), (4,0)...
So, I think that the begin point may be (0,0), x=0.
Suppose all x,y coordinates are positive numbers. (Without loss of generality, one can add offsets.) In time O(n log n), sort a list L of the points, primarily in ascending order by x coordinates and secondarily in ascending order by y coordinates. In time O(n), process point pairs (in L order) as follows. Let p, q be any two consecutive points in L, and let px, qx, py, qy denote their x and y coordinate values. From there you just need to consider several cases and it should be obvious what to do: If px=qx, do nothing. Else, if py<=qy, do nothing. Else (px>qx, py>qy) require that px + A*py < qx + A*qy, i.e. (px-qx)/(py-qy) > A.
So: Go through L in order, and find the largest A' that is satisfied for all point pairs where px>qx and py>qy. Then choose a value of A that's a little less than A', for example, A'/2. (Or, if the object of the problem is to find the largest such A, just report the A' value.)
Ok, here's a rough stab at a method.
Sort the list of points by x order. (This gives the O(nlogn)--all the following steps are O(n).)
Generate a new list of dx_i = x_(i+1) - x_i, the differences between the x coordinates. As the x_i are ordered, all of these dx_i >= 0.
Now for some A, the transformed dx_i(A) will be x_(i+1) -x_i + A * ( y_(i+1) - y_i). There will be an order change if this is negative or zero (x_(i+1)(A) < x_i(A).
So for each dx_i, find the value of A that would make dx_i(A) zero, namely
A_i = - (x_(i+1) - x_i)/(y_(i+1) - y_i). You now have a list of coefficients that would 'cause' an order swap between a consecutive (in x-order) pair of points. Watch for division by zero, but that's the case where two points have the same y, these points will not change order. Some of the A_i will be negative, discard these as you want A>0. (Negative A_i will also induce an order swap, so the A>0 requirement is a little arbitrary.)
Find the smallest A_i > 0 in the list. So any A with 0 < A < A_i(min) will be a shear that does not change the order of your points. Pick A_i(min) as that will bring two points to the same x, but not past each other.

how to fast compute distance between high dimension vectors

assume there are three group of high dimension vectors:
{a_1, a_2, ..., a_N},
{b_1, b_2, ... , b_N},
{c_1, c_2, ..., c_N}.
each of my vector can be represented as: x = a_i + b_j + c_k, where 1 <=i, j, k <= N. then the vector is encoded as (i, j, k) wich is then can be decoded as x = a_i + b_j + c_k.
my question is, if there are two vector: x = (i_1, j_1, k_1), y = (i_2, j_2, k_2), is there a method to compute the euclidian distance of these two vector without decode x and y.
Square root of the sum of squares of the differences between components. There's no other way to do it.
You should scale the values to guard against overflow/underflow issues. Search for the max difference and divide all the components by it before squaring, summing, and taking the square root.
Let's assume you have only two groups. You are trying to compute the scalar product
(a_i1 + b_j1, a_i2 + b_j2)
= (a_i1,a_i2) + (b_j1,b_j2) + (a_i1,b_j2) + (a_i2,b_j1) # <- elementary scalar products
So if you know the necessary elementary scalar products between the elements of your vectors a_i, b_j, c_k, then, you do not need to "decode" x and y and can compute the scalar product directly.
Note that this is exactly what happens when you compute an ordinary euclidian distance on a non orthogonal basis.
If you are happy with an approximate result, you could project your high dimension basis vectors using a random projection into a small dimensional space. Johnson-Lindenstrauss lemma says that you can reduce your dimension to O(log N), so that distances remain approximately the same with high probability.

Resources