how does one approach this problem on DSA - algorithm

Two arrays are given of length n and two numbers c and d.
Find the count of all pairs that follow the condition : a[i]-a[j]+c <= b[i]-b[j]+d such that i < j.
I have been thinking about it for hours and I don't know how to approach the problem. My initial thought was hashing the array to save me from any O(n^2)

I will describe a O(n * log n) time complexity solution.
Firstly, notice your condition is equivalent to a[i]-b[i] <= a[j]-b[j]+x, where x is a number.
Consider the array f such that f[i] = a[i] - b[i].
The condition is equivalent to f[i] <= f[j] + x.
To count the number of pair satisfying this condition, you can use divide and conquer: break the array in two halves recursively. The total count is the number of pairs satisfying the condition in the first half (let it be q1) plus the number of pairs satisfying the condition in the second half (let it be q2) plus the number of pairs satisfying the condition with one element in the first half and the other one in the second half (let it be q12).
The trick is that q12 in the original array is equal to q12 if both halves are sorted. But if both halves are sorted, you can count q12 in O(n) (proof left as exercise). So, the divide and conquer recursion follows the inequality T(n) <= 2*T(n/2) + O(n). By the master theorem, you can conclude the time complexity is O(n * log n).

Related

find subset with elements smaller than S

I have to find an algorithm for the following problem:
input are two numbers S and k of natural numbers and an unsorted set of n pair-wise different numbers.
decide in O(n) if there is a subset of k numbers that sum up to <=S. Note: k should not be part of the time complexity.
algorithm({x_1, ..., x_n}, k, S):
if exists |{x_i, ..., x_j}| = k and x_i + ... x_j <= S return true
I don't find a solution with time complexity O(n).
What I was able to get is in O(kn), as we search k times the minimum and sum is up:
algorithm(a={x_1, ..., x_n}, k, S):
sum = 0
for i=1,...,k:
min = a.popFirst()
for i=2,...,len(a):
if(a[i] < min):
t = a[i]
a[i] = min
min = t
sum += min
if sum <= S:
return true
else:
return false
this is in O(n) and return the right result. How can i loose the k?
Thanks for helping me, im really struggeling on this one!
Quickselect can be used to find the k smallest elements: https://en.wikipedia.org/wiki/Quickselect
It's basically quicksort, except that you only recurse on the interesting side of the pivot.
A simple implementation runs in O(N) expected time, but using median-of-medians to select a pivot, you can make that a real worst-case bound: https://en.wikipedia.org/wiki/Median_of_medians
You could build a min-heap of size k from the set. Time complexity of building this is O(n) expected time and O(n log k) worst case.
The heap should contain first k minimum elements from the set.
Then it is straightforward to see the sum of the elements in the heap is <= S. You don't need to remove the elements from the heap to calculate the sum. Just traverse the heap to calculate sum. Removing all elements entails k log k complexity.
You don't even need to consider the next higher elements, because adding them would result in sum greater than S

What is the numerical complexity of computing the empirical cdf of an array?

It's all in the title. Suppose $X$ is an array of n floats. The empirical CDF is the function (of t):
Fn(t) = (1/n) sum{1{Xi <= t} : i=1,...,n}
This has to be computed for t_1<t_2,...,t_m (e.g. for m different, sorted, values of t). My question is what is the numerical complexity of computing this? I think O(nlog(n))+O(mlog(n)) [sort the array then perform m binary search, one for each value of t]
but I may be naive. Can anyone confirm?
Edit:
Sorry for the mess. While writing the question, I realized that I was imposing some constraints that are not in the original problem. I respond to Yves's question below.
The Xi are not sorted.
The t_j are sorted and equi-spaced.
m is smaller than n, but not by orders of magnitudes: typically m~n/4.
The given expression, a sum of N 0/1 terms, is clearly O(N).
UPDATE:
If the Xi are presorted, the function is trivially CDFi = CDF(Xi) = i/N, and the computation is in a way O(0)!
If the Xi are unsorted, you'll need to sort first in O(N.Log(N)), unless the range of the variable allows a faster sorting such as Counting sort.
If you only need to evaluate for a small number of Xis, let K, then you can consider using the naïve summation, as K.N can beat N.Log(N).
UPDATE: (second change by the OP)
Else, sort the Xi if necessary and sort the tj if necessary. Then a single linear pass will suffice. Total complexity will be one of:
O(n.Log(n) + m.Log(m))
O(n.Log(n) + m)
O(n + m.Log(m))
O(n + m).
If m < Log(n) and the Xi are unsorted, use the naïve formula. Complexity O(m.n).
Possibly there could be better options when m>n.
UPDATE: final specs: Xi unsorted, Tj sorted, m < n.
The solution I would choose is as follows:
1) Sort the Xi.
2) "Merge" the sorted Xi and Tj. This means, progress simultaneously in the X and T lists, keeping two running indexes; make sure to always increment the index that causes the shortest move; use CDF(Tj)=i/n. This is a linear process. (Very close to a merge in mergesort.)
Global complexity is O(n.Log(n)), the merging term O(n) being absorbed in the former.
UPDATE: uniform sampling.
When the Tj values are equi-spaced, let Tj = T0 + D.j, you can use an histogram approach.
Allocate an array of m+1 counters, initially 0. For every Xi, compute a bin index as Floor((Xi - T0) / D). Clamp negative values to 0 and values larger than m to m. Increment that bin. In the end, every bin will tell you how many X values are in range [Tj, Tj+1[.
Compute the prefix sum of the counters. They will now tell you how many X values are smaller than Xj+1, and CDF(j)=Counter[j]/n.
[Caution, this is an unchecked sketch, can be wrong in details.]
Total computation will take n bin incrementations followed by a prefix sum on m elements, i.e. O(n) operations.
# Input data
X= [0.125, 6, 3.25, 9, 1.4375, 6, 3.125, 7]
n= len(X)
# Sampling points (1 to 6)
T0= 1
DT= 1
m= 6
# Initialize the counters: O(m)
C= [0] * m
# Accumulate the histogram: O(n)
for x in X:
i= max(0, int((x - T0) / DT))
if i < m:
C[i]+= 1
# Compute the prefix sum: O(m)
S= 0
for i in range(m - 1):
C[i + 1]+= C[i]
# Reduce: O(m)
for i in range(m):
C[i]/= float(n)
# Display
print "T=", C
T= [0.25, 0.25, 0.5, 0.5, 0.5, 0.75]
A CDF Fn(t) is always a non-decreasing function in [0..1]. Therefore I assume your notation is saying to count the number of elements Xi <= t and return that count divided by n.
Thus if t is very large, you have n/n = 1. For very small, it's 0/n = 0 as we'd expect.
This is a poor definition of an empiracle CDF. See for example see Law, Averill M., Simulation & Modeling, 4th ed., p 301 for some more advanced ideas.
The simplest efficient way to compute your function (given that m, the number of Fn(t) values you need, is unknown) is first to sort the inputs Xi. This requires O(n log n) time, but needs to be done only once no matter how many t values you're processing.
Let's call the sorted values Yi. To find the count of Yi values <= t is the same as finding i such that Yi <= t < Yi+i. This can be done by binary search in O(log n) time for a given value of t. Divide by n and you have the Fn(t) value required. Of course you can repeat this m times to get the job done in O(m log n) time.
However you say your special case is m presorted values of t_j. You can find all the i values with a single pass over the Yi and simultaneously over the t_j, in the fashion of the merge operation in mergesort. With this you find all the answers in O(m + n) time.
Putting this together with the sorting cost, you have O(m + n + n log n) = O(m + n log n).
Note this is always faster than using the binary search lookup m times, O(n log n + m log n) = O((m + n) log n).
The only case you'd want to skip the presorting is when m < O(log n). This is because with no presorting, processing all the t_j needs O(mn) time - you must touch all n elements to count the number <= t_j. Consequently, if m < O(log n), then skipping the presort leads to less than O(n log n), i.e. asymptotically faster than the presort method.

Find pairs with given difference

Given n, k and n number of integers. How would you find the pairs of integers for which their difference is k?
There is a n*log n solution, but I cannot figure it out.
You can do it like this:
Sort the array
For each item data[i], determine its two target pairs, i.e. data[i]+k and data[i]-k
Run a binary search on the sorted array for these two targets; if found, add both data[i] and data[targetPos] to the output.
Sorting is done in O(n*log n). Each of the n search steps take 2 * log n time to look for the targets, for the overall time of O(n*log n)
For this problem exists the linear solution! Just ask yourself one question. If you have a what number should be in the array? Of course a+k or a-k (A special case: k = 0, required an alternative solution). So, what now?
You are creating a hash-set (for example unordered_set in C++11) with all values from the array. O(1) - Average complexity for each element, so it's O(n).
You are iterating through the array, and check for each element Is present in the array (x+k) or (x-k)?. You check it for each element, in set in O(1), You check each element once, so it's linear (O(n)).
If you found x with pair (x+k / x-k), it is what you are looking for.
So it's linear (O(n)). If you really want O(n lg n) you should use a set on tree, with checking is_exist in (lg n), then you have O(n lg n) algorithm.
Apposition: No need to check x+k and x-k, just x+k is sufficient. Cause if a and b are good pair then:
if a < b then
a + k == b
else
b + k == a
Improvement: If you know a range, you can guarantee linear complexity, by using bool table (set_tab[i] == true, when i is in table.).
Solution similar to one above:
Sort the array
set variables i = 0; j = 1;
check the difference between array[i] and array[j]
if the difference is too small, increase j
if the difference is too big, increase i
if the difference is the one you're looking for, add it to results and increase j
repeat 3 and 4 until the end of array
Sorting is O(n*lg n), the next step is, if I'm correct, O(n) (at most 2*n comparisons), so the whole algorithm is O(n*lg n)

Finding sub-array sum in an integer array

Given an array of N positive integers. It can have n*(n+1)/2 sub-arrays including single element sub-arrays. Each sub-array has a sum S. Find S's for all sub-arrays is obviously O(n^2) as number of sub-arrays are O(n^2). Many sums S's may be repeated also. Is there any way to find count of all distinct sum (not the exact values of sums but only count) in O(n logn).
I tried an approach but stuck on the way. I iterated the array from index 1 to n.
Say a[i] is the given array. For each index i, a[i] will add to all the sums in which a[i-1] is involved and will include itself also as individual element. But duplicate will emerge if among sums in which a[i-1] is involved, the difference of two sums is a[i]. I mean that, say sums Sp and Sq end up at a[i-1] and difference of both is a[i]. Then Sp + a[i] equals Sq, giving Sq as a duplicate.
Say C[i] is count of the distinct sums in which end up at a[i].
So C[i] = C[i-1] + 1 - numbers of pairs of sums in which a[i-1] is involved whose difference is a[i].
But problem is to find the part of number of pairs in O(log n). Please give me some hint about this or if I am on wrong way and completely different approach is required problem point that out.
When S is not too large, we can count the distinct sums with one (fast) polynomial multiplication. When S is larger, N is hopefully small enough to use a quadratic algorithm.
Let x_1, x_2, ..., x_n be the array elements. Let y_0 = 0 and y_i = x_1 + x_2 + ... + x_i. Let P(z) = z^{y_0} + z^{y_1} + ... + z^{y_n}. Compute the product of polynomials P(z) * P(z^{-1}); the coefficient of z^k with k > 0 is nonzero if and only if k is a sub-array sum, so we just have to read off the number of nonzero coefficients of positive powers. The powers of z, moreover, range from -S to S, so the multiplication takes time on the order of S log S.
You can look at the sub-arrays as a kind of tree. In the sense that subarray [0,3] can be divided to [0,1] and [2,3].
So build up a tree, where nodes are defined by length of the subarray and it's staring offset in the original array, and whenever you compute a subarray, store the result in this tree.
When computing a sub-array, you can check this tree for existing pre-computed values.
Also, when dividing, parts of the array can be computed on different CPU cores, if that matters.
This solution assumes that you don't need all values at once, rather ad-hoc.
For the former, there could be some smarter solution.
Also, I assume that we're talking about counts of elements in 10000's and more. Otherwise, such work is a nice excercise but has not much of a practical value.

Selection i'th smallest number algorithm

I'm reading Introduction to Algorithms book, second edition, the chapter about Medians and Order statistics. And I have a few questions about randomized and non-randomized selection algorithms.
The problem:
Given an unordered array of integers, find i'th smallest element in the array
a. The Randomized_Select algorithm is simple. But I cannot understand the math that explains it's work time. Is it possible to explain that without doing deep math, in more intuitive way? As for me, I'd think that it should work for O(nlog n), and in worst case it should be O(n^2), just like quick sort. In avg randomizedPartition returns near middle of the array, and array is divided into two each call, and the next recursion call process only half of the array. The RandomizedPartition costs (p-r+1)<=n, so we have O(n*log n). In the worst case it would choose every time the max element in the array, and divide the array into two parts - (n-1) and (0) each step. That's O(n^2)
The next one (Select algorithm) is more incomprehensible then previous:
b. What it's difference comparing to previous. Is it faster in avg?
c. The algorithm consists of five steps. In first one we divide the array into n/5 parts each one with 5 elements (beside the last one). Then each part is sorted using insertion sort, and we select 3rd element (median) of each. Because we have sorted these elements, we can be sure that previous two <= this pivot element, and the last two are >= then it. Then we need to select avg element among medians. In the book stated that we recursively call Select algorithm for these medians. How we can do that? In select algorithm we are using insertion sort, and if we are swapping two medians, we need to swap all four (or even more if it is more deeper step) elements that are "children" for each median. Or do we create new array that contain only previously selected medians, and are searching medians among them? If yes, how can we fill them in original array, as we changed their order previously.
The other steps are pretty simple and look like in the randomized_partition algorithm.
The randomized select run in O(n). look at this analysis.
Algorithm :
Randomly choose an element
split the set in "lower than" set L and "bigger than" set B
if the size of "lower than" is j-1 we found it
if the size is bigger, then Lookup in L
or lookup in B
The total cost is the sum of :
The cost of splitting the array of size n
The cost of lookup in L or the cost of looking up in B
Edited: I Tried to restructure my post
You can notice that :
We always go next in the set with greater amount of elements
The amount of elements in this set is n - rank(xj)
1 <= rank(xi) <= n So 1 <= n - rank(xj) <= n
The randomness of the element xj directly affect the randomness of the number of element which
are greater xj(and which are smaller than xj)
if xj is the element chosen , then you know that the cost is O(n) + cost(n - rank(xj)). Let's call rank(xj) = rj.
To give a good estimate we need to take the expected value of the total cost, which is
T(n) = E(cost) = sum {each possible xj}p(xj)(O(n) + T(n - rank(xj)))
xj is random. After this it is pure math.
We obtain :
T(n) = 1/n *( O(n) + sum {all possible values of rj when we continue}(O(n) + T(n - rj))) )
T(n) = 1/n *( O(n) + sum {1 < rj < n, rj != i}(O(n) + T(n - rj))) )
Here you can change variable, vj = n - rj
T(n) = 1/n *( O(n) + sum { 0 <= vj <= n - 1, vj!= n-i}(O(n) + T(vj) ))
We put O(n) outside the sum , gain a factor
T(n) = 1/n *( O(n) + O(n^2) + sum {1 <= vj <= n -1, vj!= n-i}( T(vj) ))
We put O(n) and O(n^2) outside, loose a factor
T(n) = O(1) + O(n) + 1/n *( sum { 0 <= vj <= n -1, vj!= n-i} T(vj) )
Check the link on how this is computed.
For the non-randomized version :
You say yourself:
In avg randomizedPartition returns near middle of the array.
That is exactly why the randomized algorithm works and that is exactly what it is used to construct the deterministic algorithm. Ideally you want to pick the pivot deterministically such that it produces a good split, but the best value for a good split is already the solution! So at each step they want a value which is good enough, "at least 3/10 of the array below the pivot and at least 3/10 of the array above". To achieve this they split the original array in 5 at each step, and again it is a mathematical choice.
I once created an explanation for this (with diagram) on the Wikipedia page for it... http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

Resources