Minimize Sum of Absolute Difference of Two Arrays - algorithm

I have two arrays of integers A and B both of size n. The cost of a pair is |A(i) - B(i)|.
I want to pair the n elements of A and B such that the sum of all costs across all A(i)s and B(i)s are minimized.
I understand that I can get O(n log n) by sorting A, then sorting B, and then pairing them together from 1...n respectively, but after attempting for hours and hours, I can't figure out how to prove it. Can somebody help me out?
I've seen how to implement it, I just don't get how to prove it

I am following a slightly different approach here to prove this fact by making use of squares rather than absolute.
Consider 2 arrays, A = [a1, a2, ..., an] and B = [b1, b2, ..., bn].
Now, even if I use random pairing (form a pair using any index from A and B ),
Let's say, the sum of squares of difference (S) = a1^2 + b1^2 + a2^2 + b2^2 + ... + an^2 + bn^2 - 2 * (a1 * b3 + a2 * b4 + .... + an * b56 + bn * a34).
The above sum can be represented as S = sum(ai^2) + sum(bi^2) - 2 * sum(ai*bi), for i goes from 1 to n.
To minimise this sum, we need to maximise the part sum(ai*bi), for i goes from 1 to n.
The term sum(ai*bi) will be maximum when the 2 arrays will be sorted.
Thanks for pointing out #Abhinav Mathur: The statement The term sum(ai*bi) will be maximum when the 2 arrays will be sorted can be proved using rearrangement inequality.

Assume that according to the current sorted arrays, there is a pair |x-a|, and another pair |y-b|. Let's say that switching the elements would give a lesser sum i.e. a more optimal solution.
(Note: while switching around two pairs, the rest of array remains unaffected).
Current total sum of pairs = |x-a| + |y-b|
Modified sum after switching pairs = |x-b| + |y-a|
Difference in sums = diff = |x-b| - |x-a| + |y-a| + |y-b|
If diff is negative, it means we have found a better ordering. If not, it means our original solution was better.
Now, you can take cases and analyse this. (Since the arrays are sorted, let x<y (they're from the first array) and a<b (they're from second array).
Case 1: x>b or y<a:
In this case, both sums will be equal, which can be easily seen by expanding the modulus
Case 2: a<x<b:
If y>b, diff = 2*(b-x). Since we assumed b>x, diff is positive.
If y<b, diff = 2*(y-x). Since y>x as stated earlier, diff is again positive.
You can continue taking similar cases and prove that diff will always be positive, meaning that our original ordering will be the most efficient one.

Sorting and pairing creates a matching that we might call "monotonic", which ensures that if A[i] matches B[x] and A[j] matches B[y], then:
If A[i] < A[j] then B[x] <= B[y]; and
If B[x] < B[y] then A[i] <= A[j]
If you choose a matching that is not monotonic, then one of these rules will be violated for some pair of matchings.
If we pick any two elements from both arrays such that A[i] <= A[j] and B[x] <= B[y], then we can evaluate the cost of the monotonic pairing and the other pairing. Note that if A[j] = A[j] or B[i] = B[j] then both pairings have the same cost so it doesn't matter which one we call monotonic.
In order to compare the costs, we need to get rid of the absolute value operations. We can do that by separately considering all the possible orderings between the 4 values:
Case: A[i] <= A[j] <= B[x] <= B[y]:
Monotonic cost: B[x]-A[i] + B[y]-A[j]
Swapped cost: B[y]-A[i] + B[x]-A[j]
Difference: 0
cost is the same - doesn't matter which we choose
Case: A[i] <= B[x] <= A[j] <= B[y]
Monotonic cost: B[x]-A[i] + B[i]-A[j]
other cost: B[y]-A[i] + A[j]-B[x]
Difference: 2A[j] - 2B[x]
since A[j] >= B[x], monotonic is as good or better
... etc
If you go through all 6 possible orderings, in every case you find that the monotonic matching is as good or better. Given any matching, you can make every pair of element matchings monotonic, and the cost can only go down.
If you start with an optimal matching and make every pair of matchings monotonic then you end up with an optimal monotonic matching. (In fact the one you start with has to be monotonic if it's optimal, but we don't have to prove that) Since every monotonic matching has the same cost, and at least one of them is optimal, they must all be optimal.

Related

Adding two arrays into one

I have two arrays (a and b) of size n, (positive whole numbers)
a= [a1…..an] b= [b1….bn]
I want to store them in array c, also an array of size n
c=[c1…..cn]
where I add one element from a plus one element from b (each used once) into c, lets say the first element in c is combining a1+b3
Quick example:
n=4 a=[a1,a2,a3,a4] b=[b1,b2,b3,b4]
one way could be:
c=[a1+b2,b3+a4,a2+b1,a3+b4]
The problem is that I want to add them in a way so that the elements in c become as evenly distributed as possible,
One ideal case would be that c came out as:
c=[5,5,5,5]
but the numbers in a and b might not match up so they become even, so I want it to come as close to even as possible.
I an trying to find a way so that the difference between the biggest number in c minus the smallest number in c (after being combined as evenly as I can) to be as small as possible. In my optimal example above that would be 5-5=0 which is most optimal since 0 is the smallest minimum difference I want to achieve. Some other case with other numbers might come out as 6-5=1, which might be the smallest I could get in that situation
My way of going would be to sort array a in ascending order and my array b in descending order,and then combining them with the same element that they are in. Im not sure if this is the best way or the fastest to do this in, I want my code (doing it with python) to be fast. I cant come up with a better way where I could distribute them more evenly,any clue if there are better ways to solve this problem? I really appreciate all advice I could get! Thank you
When trying to solve it in a way where one of the arrays is ascending, and the other one being descending, there might already exist an algorithm that solves it better that I have not thought of. Thank you for reading!
Your algorithm is both correct and fast. It is just proving it that is optimal which is tricky.
We can do this by proving the following two results.
Any other matching of a and b will lead to a maximum at least as big as yours.
Any other matching of a and b will lead to a minimum at least as small as yours.
And the conclusion is that any other matching must have a maximum-minimum at least as big as yours. From which yours must be optimal.
Now let's look at part 1. Sort a ascending, and b descending. Find the i such that c[i] = a[i] + b[i] is a maximum. Suppose that m is any other matching where we're matching up a[j] + b[m[j]]. Note that m[1], ..., m[n] is a permutation of 1, ..., n.
If a[i] + b[m[i]] >= a[i] + b[i] then part 1 is true..
If a[i] + b[m[i]] < a[i] + b[i] then b[m[i]] < b[i] and so we must have i < m[i]. Now there are n-i numbers in the range i+1, ..., n. m maps something out of that range into that range. Because m is a permutation, by the pigeonhole principle, m must map something in that range, out of that range.
In other words there must be a j > i such that m[j] <= i. But now a[i] <= a[j] and b[i] <= b[m[j]] and therefore a[i] + b[i] <= a[j] + b[m[j]]. And so part 1 is true again.
That concludes the proof of part 1.
The proof of part 2 is similar. Except now a[i] + b[i] is at a minimum, m[i] < i, there is a j < i with i <= m[j], a[j] <= a[i], b[m[j]] <= b[i], and a[j] + b[m[j]] <= a[i] + b[i].
And as noted, part 1 and part 2 together implies that you've minimized the difference between the minimum and maximum.

Correctness of greedy algorithm

In non-decreasing sequence of (positive) integers two elements can be removed when . How many pairs can be removed at most from this sequence?
So I have thought of the following solution:
I take given sequence and divide into two parts (first and second).
Assign to each of them iterator - it_first := 0 and it_second := 0, respectively. count := 0
when it_second != second.length
if 2 * first[it_first] <= second[it_second]
count++, it_first++, it_second++
else
it_second++
count is the answer
Example:
count := 0
[1,5,8,10,12,13,15,24] --> first := [1,5,8,10], second := [12,13,15,24]
2 * 1 ?< 12 --> true, count++, it_first++ and it_second++
2 * 5 ?< 13 --> true, count++, it_first++ and it_second++
2 * 8 ?< 15 --> false, it_second++
8 ?<24 --> true, count ++it_second reach the last element - END.
count == 3
Linear complexity (the worst case when there are no such elements to be removed. n/2 elements compare with n/2 elements).
So my missing part is 'correctness' of algorithm - I've read about greedy algorithms proof - but mostly with trees and I cannot find analogy. Any help would be appreciated. Thanks!
EDIT:
By correctness I mean:
* It works
* It cannot be done faster(in logn or constant)
I would like to put some graphics but due to reputation points < 10 - I can't.
(I've meant one latex at the beginning ;))
Correctness:
Let's assume that the maximum number of pairs that can be removed is k. Claim: there is an optimal solution where the first elements of all pairs are k smallest elements of the array.
Proof: I will show that it is possible to transform any solution into the one that contains the first k elements as the first elements of all pairs.
Let's assume that we have two pairs (a, b), (c, d) such that a <= b <= c <= d, 2 * a <= b and 2 * c <= d. In this case, pairs (a, c) and (b, d) are valid, too. And now we have a <= c <= b <= d. Thus, we can always transform out pairs in such a way that the first element from any pair is not greater than the second element of any pair.
When we have this property, we can simply substitute the smallest element among all first all elements of all pairs with the smallest element in the array, the second smallest among all first elements - with the second smallest element in the array and so on without invalidating any pair.
Now we know that there is an optimal solution that contains k smallest elements. It is clear that we cannot make the answer worse by taking the smallest unused element(making it bigger can only reduce the answer for the next elements) which fits each of them. Thus, this solution is correct.
A note about the case when the length of the array is odd: it doesn't matter where the middle element goes: to the first or to the second half. In the first half it is useless(there are not enough elements in the second half). If we put it to the second half, it is useless two(let's assume that we took it. It means that there is "free space" somewhere in the second half. Thus, we can shift some elements by one and get rid of it).
Optimality in terms of time complexity: the time complexity of this solution is O(n). We cannot find the answer without reading the entire input in the worst case and reading is already O(n) time. Thus, this algorithm is optimal.
Presuming your method. Indices are 0-based.
Denote in general:
end_1 = floor(N/2) boundary (inclusive) of first part.
Denote while iterating:
i index in first part, j index in second part,
optimal solution until this point sol(i,j) (using algorithm from front),
pairs that remain to be paired-up optimally behind (i,j) point i.e. from
(i+1,j+1) onward rem(i,j) (can be calculated using algorithm from back),
final optimal solution can be expressed as the function of any point as sol(i,j) + rem(i,j).
Observation #1: when doing algorithm from front all points in [0, i] range are used, some points from [end_1+1, j] range are not used (we skip a(j) not large engough). When doing algorithm from back some [i+1, end_1] points are not used, and all [j+1, N] points are used (we skip a(i) not small enough).
Observation #2: rem(i,j) >= rem(i,j+1), because rem(i,j) = rem(i,j+1) + M, where M can be 0 or 1 depending on whether we can pair up a(j) with some unused element from [i+1, end_1] range.
Argument (by contradiction): let's assume 2*a(i) <= a(j) and that not pairing up a(i) and a(j) gives at least as good final solution. By the algorithm we would next try to pair up a(i) and a(j+1). Since:
rem(i,j) >= rem(i,j+1) (see above),
sol(i,j+1) = sol(i,j) (since we didn't pair up a(i) and a(j))
we get that sol(i,j) + rem(i,j) >= sol(i,j+1) + rem(i,j+1) which contradicts the assumption.

Find a permutation that minimizes the sum

I have an array of elements [(A1, B1), ..., (An, Bn)] (all are positive floats and Bi <= 1) and I need to find such permutation which mimimizes the sum A1 + B1 * A2 + B1 * B2 * A3 + ... + B1 * ... B(n-1) * An.
Definitely I can just try all of them and select the one which gives the smallest sum (this will give correct result in O(n!)).
I tried to change the sum to A1 + B1 * (A2 + B2 * (A3 + B3 * (... + B(n-1) * An)) and tried to use a greedy algorithm which grabs the biggest Ai element on each of the steps (this does not yield a correct result).
Now when I look at the latest equation, it looks to me that here I see optimal substructure A(n - 1) + B(n - 1) * An and therefore I have to use dynamic programming, but I can not figure out correct direction. Any thoughts?
I think this can be solved in O(N log(N)).
Any permutation can be obtained by swapping pairs of adjacent elements; this is why bubble sort works, for example. So let's take a look at the effect of swapping entries (A[i], B[i]) and (A[i+1], B[i+1]). We want to find out in which cases it's a good idea to make this swap. This has effect only on the ith and i+1th terms, all others stay the same. Also, both before and after the swap, both terms have a factor B[1]*B[2]*...*B[i-1], which we can call C for now. C is a positive number.
Before the swap, the two terms we're dealing with are C*A[i] + C*B[i]*A[i+1], and afterwards they are C*A[i+1] + C*B[i+1]*A[i]. This is an improvement if the difference between the two is positive:
C*(A[i] + B[i]*A[i+1] - A[i+1] - B[i+1]*A[i]) > 0
Since C is positive, we can ignore that factor and look just at the As and Bs. We get
A[i] - B[i+1]*A[i] > A[i+1] - B[i]*A[i+1]
or equivalently
(1 - B[i+1])*A[i] > (1 - B[i])*A[i+1]
Both of these expressions are nonnegative; if one of B[i] or B[i+1] is one, then the term containing 'one minus that variable' is zero (so we should swap if B[i] is one but not if B[i+1] is one); if both variables are one, then both terms are zero. Let's assume for now that neither is equal to one; then we can rewrite further to obtain
A[i]/(1 - B[i]) > A[i+1]/(1 - B[i+1])
So we should compute this expression D[i] := A[i]/(1 - B[i]) for both terms and swap them if the left one is greater than the right one. We can extend this to the case where one or both Bs are one by defining D[i] to be infinitely big in that case.
OK, let's recap - what have we found? If there is a pair i, i+1 where D[i] > D[i+1], we should swap those two entries. That means that the only case where we cannot improve the result by swapping, is when we have reordered the pairs so that the D[i] values are in increasing order -- that is, all the cases with B[i] = 1 come last (recall that that corresponds to D[i] being infinitely large) and otherwise in increasing order of D[i] value. We can achieve that by sorting with respect to the D[i] value. A quick examination of our steps above shows that the order of pairs with equal D[i] value does not impact the final value.
Computing all D[i] values can be done in a single, linear-time pass. Sorting can be done with an O(N log(N)) algorithm (we needed the swapping-of-neighbouring-elements stuff only as an argument/proof to show that this is the optimal solution, not as part of the implementation).

Minimum of sum of absolute values

Problem statement:
There are 3 arrays A,B,C all filled with positive integers, and all the three arrays are of the same size.
Find min(|a-b|+|b-c|+|c-a|) where a is in A, b is in B, c is in C.
I worked on the problem the whole weekend. A friend told me that it can be done in linear time. I don't see how that could be possible.
How would you do it ?
Well, I think I can do it in O(n log n). I can only do O(n) if the arrays are initially sorted.
First, observe that you can permute a,b,c however you like without changing the value of the expression. So let x be the smallest of a,b,c; let y be the middle of the three; and let z be the maximum. Then note that the expression just equals 2*(z-x). (Edit: This is easy to see... Once you have the three numbers in order, x < y < z, the sum is just (y-x) + (z-y) + (z-x) which equals 2*(z-x))
Thus, all we are really trying to do is find three numbers such that the outer two are as close together as possible, with the other number "sandwiched" between them.
So start by sorting all three arrays in O(n log n). Maintain an index into each array; call these i, j, and k. Initialize all three to zero. Whichever index points to the smallest value, increment that index. That is, if A[i] is smaller than B[j] and C[k], increment i; if B[j] is smallest, increment j; if C[k] is smallest, increment k. Repeat, keeping track of |A[i]-B[j]| + |B[j]-C[k]| + |C[k]-A[i]| the whole time. The smallest value you observe during this march is your answer. (When the smallest of the three is at the end of its array, stop because you are done.)
At each step, you add one to exactly one index; but you can only do this n times for each array before hitting the end. So this is at most 3*n steps, which is O(n), which is less than O(n log n), meaning the total time is O(n log n). (Or just O(n) if you can assume the arrays are sorted.)
Sketch of a proof that this works: Suppose A[I], B[J], C[K] are the a, b, c that form the actual answer; i.e., they have the minimum |a-b|+|b-c|+|c-a|. Suppose further that a > b > c; the proof for the other cases is symmetric.
Lemma: During our march, we do not increment j past J until after we increment k past K. Proof: We always increment the index of the smallest element, and when k <= K, B[J] > C[k]. So when j=J and k <= K, B[j] is not the smallest element, so we do not increment j.
Now suppose we increment k past K before i reaches I. What do things look like just before we perform that increment? Well, C[k] is the smallest of the three at that moment, because we are about to increment k. A[i] is less than or equal to A[I], because i < I and A is sorted. Finally, j <= J because k <= K (by our Lemma), so B[j] is also less than A[I]. Taken together, this means our sum-of-abs-diff at this moment is less than 2*(c-a), which is a contradiction.
Thus, we do not increment k past K until i reaches I. Therefore, at some point during our march i=I and k=K. By our Lemma, at this point j is less than or equal to J. So at this point, either B[j] is less than the other two and j will get incremented; or B[j] is between the other two and our sum is just 2*(A[i]-C[k]), which is the right answer.
This proof is sloppy; in particular, it fails to explicitly account for the case where one or more of a,b,c are equal. But I think that detail can be worked out pretty easily.
I would write a really simple program like this:
#!/usr/bin/python
import sys, os, random
A = random.sample(range(100), 10)
B = random.sample(range(100), 10)
C = random.sample(range(100), 10)
minsum = sys.maxint
for a in A:
for b in B:
for c in C:
print 'checking with a=%d b=%d c=%d' % (a, b, c)
abcsum = abs(a - b) + abs(b - c) + abs(c - a)
if abcsum < minsum:
print 'found new low sum %d with a=%d b=%d c=%d' % (abcsum, a, b, c)
minsum = abcsum
And test it over and over until I saw some pattern emerge. The pattern I found here is what would be expected: the numbers that are closest together in each set, regardless of whether the numbers are "high" or "low", are those that produce the smallest minimum sum. So it becomes a nearest-number problem. For whatever that's worth, probably not much.

Need an algorithm for this problem

There are two integer sequences A[] and B[] of length N,both unsorted.
Requirement: through the swapping of elements between A[] and B[]( can randomly exchange, not with same index), make the difference between {the sum of all elements in A[]} and {the sum of all elements in B[]} to be minimum.
PS: actually,it is an interview question I encountered.
Many thanks
This is going to be NP-hard! I believe you can do a reduction from Subset Sum to this.
As per BlueRaja/polygene's comments, I will try to provide a full reduction from Subset Sum.
Here is a reduction:
Subset Sum problem: Given integers x1, x2, ..., xn, is there some non-empty subset which sums to zero?
Our problem: Given two integer arrays of size k, find the minimum possible difference of the sum of the two arrays, assuming we can shuffle around the integers in the arrays, treating both arrays as one array.
Say we had a polynomial time algo for our problem.
Say now you are given integers T = {x1,x2, ...,xn} (multiset)
Let Si = x1 + x2 + ...+ xn + xi.
Let Ti = {x1, x2, ..., xi-1, xi+1, ..., xn } ( = T - xi)
Define
Ai = Array formed using Ti
Bi = [Si, 0, ..., 0] (i.e one element is Si and rest are zeroes).
Let mi = the min difference found by our problem for arrays Ai and Bi
(we run our problem n times).
Claim: Some non-empty subset of T sums to zero if and only if, there is some i, for which mi = 0.
Proof: (wlog) say x1 + x2 + .. + xk = 0
Then
A = [xk+1, ..., xn, 0, ...0]
B = [x2, x3, ..., xk, S1, 0, ..0]
gives the minimum difference m1 to be |x2 + .. + xk + (x1 + ... + xn) + x1 - (xk+1 + .. + xn)| = |2(x1+ x2 + .. xk)| = 0.
Similarly the if part can be proved.
In fact, this actually also follows (more easily) from Partition too: just create new array with all zeroes.
Hoepfully I haven't made any mistakes.
Take any instance of the NP-complete partition problem:
Partition a multiset A of positive integers into two multisets B and C with the same sum
like {a1,a2,...,an}. Add n zeroes {0,0,0...,0,a1,...,an} and ask if the set can be partitioned into two multisets A and B with the same sum and same number of elements. I claim these two conditions are equivalent:
If A and B are a solution to the problem, then you can strike out the zeroes and get a solution of partiton problem.
If there is a solution to the partition problem, for example ai1 + ai2 + ... aik = aj1 + ... +ajl where {ai1, ai2, aik, aj1, ..., ajl} = {a1, ... , an} then obviously k+l = n. Add l zeroes to the left side and k zeroes to the right side and you'll get 0 + ... + 0 + ai1 + ai2 + ... aik = 0 + ... + 0 + aj1 + ... +ajl, whichi is a solution of your problem.
So, this is a reduction (so the problem is NP-hard) and the problem is NP, so it is NP-complete.
"sequences A[] and B[] of length N" -> does this mean both A and B are each of length N?
(For the purpose of clarity I am using 1-based arrays below).
If so, how about this:
Assume A[1..N] and B[1..N]
Concatenate A and B into a new array C of length 2N: C[1..N] <- A[1..N]; C[N+1 .. 2N] <- B[1..N]
Sort C in ascending order.
Take the first pair of numbers from C; send the first element (C[1]) to A[1] and second element (C[2]) to B[1]
Take the second pair of numbers from C; this time send the second element (C[4]) to A[2] and the first element (C[3]) to B[2] (the order of elements in the pair sent to A and B is the opposite of 3)
... repeat 3 and 4 until C is exhausted
The observation here is that, in a sorted array, an adjacent pair of numbers will have the smallest difference (compared to a pair of numbers from non-adjacent positions). Step 3 ensures that A[1] and B[1] consists of a pair of numbers with the least possible difference. Step 4 ensures that (a) A[2] and B[2] consist of a pair of numbers with the least possible difference (from the available numbers) and also (b) that the difference is opposite in sign from step 3. By continuing like this, we are ensuring that A[i] and B[i] contain numbers with the least possible difference. Also, by flipping the order in which we send elements to A and B, we are ensuring that the difference changes sign for each successive i.
Try being greedy about it. Given such limited information, I'm not sure what else one could put out there.
I'm not sure that this will ensure the minimum possible distance, but the first thing that comes to mi mind is something like this:
int diff=0;
for (int i = 0; i<len; i++){
int x = a[i] - b[i];
if (abs(diff - x) > abs(diff + x)){
swap(a,b,i);
diff-=x;
}else{
diff+=x;
}
}
assuming that you have a swap function which takes the two arrays and exchanges the items at position i :)
computing and adding the difference between the two values at position i you get the incremental difference between the sums of the elements of the two arrays.
at each step you check if it's better to add (a[i]-b[i]) or (b[i]-a[i]). if the b[i]-a[i] it's the case, you swap the elements at position i in the arrays.
Maybe this will not be the best way, but it should be a start :)
The problem is NP-Complete.
We can reduce the partition problem to the decision version of this problem, i.e. given two arrays of ints of the same size, determine whether items can be swapped so that the sums are equal.
The input to the partition problem: a set S of integers, of size N
In order to transform this input into an input to our problem, we define A to be an array of all items in S, and B an array of the same size, with B[i]=0 for all i. This transformation is linear in the input size.
It is clear that our algorithm applied on A and B returns true if and only if there is a partition of S into 2 subsets such that the sums are equal.

Resources