Fixing this faulty Bingo Sort implementation - sorting

While studying Selection Sort, I came across a variation known as Bingo Sort. According to this dictionary entry here, Bingo Sort is:
A variant of selection sort that orders items by first finding the least value, then repeatedly moving all items with that value to their final location and find the least value for the next pass.
Based on the definition above, I came up with the following implementation in Python:
def bingo_sort(array, ascending=True):
from operator import lt, gt
def comp(x, y, func):
return func(x, y)
i = 0
while i < len(array):
min_value = array[i]
j = i + 1
for k in range(i + 1, len(array), 1):
if comp(array[k], min_value, (lt if ascending else gt)):
min_value = array[k]
array[i], array[k] = array[k], array[i]
elif array[k] == min_value:
array[j], array[k] = array[k], array[j]
j += 1
i = j
return array
I know that this implementation is problematic. When I run the algorithm on an extremely small array, I get a correctly sorted array. However, running the algorithm with a larger array results in an array that is mostly sorted with incorrect placements here and there. To replicate the issue in Python, the algorithm can be ran on the following input:
test_data = [[randint(0, 101) for i in range(0, 101)],
[uniform(0, 101) for i in range(0, 101)],
["a", "aa", "aaaaaa", "aa", "aaa"],
[5, 5.6],
[3, 2, 4, 1, 5, 6, 7, 8, 9]]
for dataset in test_data:
print(dataset)
print(bingo_sort(dataset, ascending=True, mutation=True))
print("\n")
I cannot for the life of me realize where the fault is at since I've been looking at this algorithm too long and I am not really proficient at these things. I could not find an implementation of Bingo Sort online except an undergraduate graduation project written in 2020. Any help that can point me in the right direction would be greatly appreciated.

I think your main problem is that you're trying to set min_value in your first conditional statement and then to swap based on that same min_value you've just set in your second conditional statement. These processes are supposed to be staggered: the way bingo sort should work is you find the min_value in one iteration, and in the next iteration you swap all instances of that min_value to the front while also finding the next min_value for the following iteration. In this way, min_value should only get changed at the end of every iteration, not during it. When you change the value you're swapping to the front over the course of a given iteration, you can end up unintentionally shuffling things a bit.
I have an implementation of this below if you want to refer to something, with a few notes: since you're allowing a custom comparator, I renamed min_value to swap_value as we're not always grabbing the min, and I modified how the comparator is defined/passed into the function to make the algorithm more flexible. Also, you don't really need three indexes (I think there were even a couple bugs here), so I collapsed i and j into swap_idx, and renamed k to cur_idx. Finally, because of how swapping a given swap_val and finding the next_swap_val is to be staggered, you need to find the initial swap_val up front. I'm using a reduce statement for that, but you could just use another loop over the whole array there; they're equivalent. Here's the code:
from operator import lt, gt
from functools import reduce
def bingo_sort(array, comp=lt):
if len(array) <= 1:
return array
# get the initial swap value as determined by comp
swap_val = reduce(lambda val, cur: cur if comp(cur, val) else val, array)
swap_idx = 0 # set the inital swap_idx to 0
while swap_idx < len(array):
cur_idx = swap_idx
next_swap_val = array[cur_idx]
while cur_idx < len(array):
if comp(array[cur_idx], next_swap_val): # find next swap value
next_swap_val = array[cur_idx]
if array[cur_idx] == swap_val: # swap swap_vals to front of the array
array[swap_idx], array[cur_idx] = array[cur_idx], array[swap_idx]
swap_idx += 1
cur_idx += 1
swap_val = next_swap_val
return array
In general, the complexity of this algorithm depends on how many duplicate values get processed, and when they get processed. This is because every time k duplicate values get processed during a given iteration, the length of the inner loop is decreased by k for all subsequent iterations. Performance is therefore optimized when large clusters of duplicate values are processed early on (as when the smallest values of the array contain many duplicates). From this, there are basically two ways you could analyze the complexity of the algorithm: You could analyze it in terms of where the duplicate values tend to appear in the final sorted array (Type 1), or you could assume the clusters of duplicate values are randomly distributed through the sorted array and analyze complexity in terms of the average size of duplicate clusters (that is, in terms of the magnitude of m relative to n: Type 2).
The definition you linked uses the first type of analysis (based on where duplicates tend to appear) to derive best = Theta(n+m^2), average = Theta(nm), worst = Theta(nm). The second type of analysis produces best = Theta(n), average = Theta(nm), worst = Theta(n^2) as you vary m from Theta(1) to Theta(m) to Theta(n).
In the best Type 1 case, all duplicates will be among the smallest elements of the array, such that the run-time of the inner loop quickly decreases to O(m), and the final iterations of the algorithm proceed as an O(m^2) selection sort. However, there is still the up-front O(n) pass to select the initial swap value, so the overall complexity is O(n + m^2).
In the worst Type 1 case, all duplicates will be among the largest elements of the array. The length of the inner loop isn't substantially shortened until the last iterations of the algorithm, such that we achieve a run-time looking something like n + n-1 + n-2 .... + n-m. This is a sum of m O(n) values, giving us O(nm) total run-time.
In the average Type 1 case (and for all Type 2 cases), we don't assume that the clusters of duplicate values are biased towards the front or back of the sorted array. We take it that the m clusters of duplicate values are randomly distributed through the array in terms of their position and their size. Under this analysis, we expect that after the initial O(n) pass to find the first swap value, each of the m iterations of the outer loop reduce the length of the inner loop by approximately n/m. This leads to an expression of the overall run-time for unknown m and randomly distributed data as:
We can use this expression for the average case run-time with randomly distributed data and unknown m, Theta(nm), as the average Type 2 run-time, and it also directly gives us the best and worst case run-times based on how we might vary the magnitude of n.
In the best Type 2 case, m might just be some constant value independent of n. if we have m=Theta(1) randomly distributed duplicate clusters, the best case run time is then Theta(n*Theta(1))) = Theta(n). For example as you would see O(2n) = O(n) performance from bingo-sort with just one unique value (one pass to find the find value, one pass to swap every single value to the front), and this O(n) asymptotic complexity still holds if m is bounded by any constant.
However in the worst Type 2 case we could have m=Theta(n), and bingo sort essentially devolves into O(n^2) selection sort. This is clearly the case for m = n, but if the amount the inner-loop's run-time is expected to decrease by with each iteration, n/m, is any constant value, which is the case for any m value in Theta(n), we still see O(n^2) complexity.

Related

Sample number with equal probability which is not part of a set

I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)

Online algorithm for random permutation of N integers

Imagine a standard permute function that takes an integer and returns a vector of the first N natural numbers in a random permutation. If you only need k (<= N) of them, but don't know k beforehand, do you still have to perform a O(N) generation of the permutation? Is there a better algorithm than:
for x in permute(N):
if f(x):
break
I'm imagining an API such as:
p = permuter(N)
for x = p.next():
if f(x):
break
where the initialization is O(1) (including memory allocation).
This question is often viewed as a choice between two competing algorithms:
Strategy FY: A variation on the Fisher-Yates shuffle where one shuffle step is performed for each desired number, and
Strategy HT: Keep all generated numbers in a hash table. At each step, random numbers are produced until a number which is not in the hash table is found.
The choice is performed depending on the relationship between k and N: if k is sufficiently large, the strategy FY is used; otherwise, strategy HT. The argument is that if k is small relative to n, maintaining an array of size n is a waste of space, as well as producing a large initialization cost. On the other hand, as k approaches n more and more random numbers need to be discarded, and towards the end producing new values will be extremely slow.
Of course, you might not know in advance the number of samples which will be requested. In that case, you might pessimistically opt for FY, or optimistically opt for HT, and hope for the best.
In fact, there is no real need for trade-off, because the FY algorithm can be implemented efficiently with a hash table. There is no need to initialize an array of N integers. Instead, the hash-table is used to store only the elements of the array whose values do not correspond with their indices.
(The following description uses 1-based indexing; that seemed to be what the question was looking for. Hopefully it is not full of off-by-one errors. So it generates numbers in the range [1, N]. From here on, I use k for the number of samples which have been requested to date, rather than the number which will eventually be requested.)
At each point in the incremental FY algorithm a single index r is chosen at random from the range [k, N]. Then the values at indices k and r are swapped, after which k is incremented for the next iteration.
As an efficiency point, note that we don't really need to do the swap: we simply yield the value at r and then set the value at r to be the value at k. We'll never again look at the value at index k so there is no point updating it.
Initially, we simulate the array with a hash table. To look up the value at index i in the (virtual) array, we see if i is present in the hash table: if so, that's the value at index i. Otherwise the value at index i is i itself. We start with an empty hash table (which saves initialization costs), which represents an array whose value at every index is the index itself.
To do the FY iteration, for each sample index k we generate a random index r as above, yield the value at that index, and then set the value at index r to the value at index k. That's exactly the procedure described above for FY, except for the way we look up values.
This requires exactly two hash-table lookups, one insertion (at an already looked-up index, which in theory can be done more quickly), and one random number generation for each iteration. That's one more lookup than strategy HT's best case, but we have a bit of a saving because we never need to loop to produce a value. (There is another small potential saving when we rehash because we can drop any keys smaller than the current value of k.)
As the algorithm proceeds, the hash table will grow; a standard exponential rehashing strategy is used. At some point, the hash table will reach the size of a vector of N-k integers. (Because of hash table overhead, this point will be reached at a value of k much less than N, but even if there were no overhead this threshold would be reached at N/2.) At that point, instead of rehashing, the hash is used to create the tail of the now non-virtual array, a procedure which takes less time than a rehash and never needs to be repeated; remaining samples will be selected using the standard incremental FY algorithm.
This solution is slightly slower than FY if k eventually reaches the threshold point, and it is slightly slower than HT if k never gets big enough for random numbers to be rejected. But it is not much slower in either case, and if never suffers from pathological slowdown when k has an awkward value.
In case that was not clear, here is a rough Python implementation:
from random import randint
def sampler(N):
k = 1
# First phase: Use the hash
diffs = {}
# Only do this until the hash table is smallish (See note)
while k < N // 4:
r = randint(k, N)
yield diffs[r] if r in diffs else r
diffs[r] = diffs[k] if k in diffs else k
k += 1
# Second phase: Create the vector, ignoring keys less than k
vbase = k
v = list(range(vbase, N+1))
for i, s in diffs.items():
if i >= vbase:
v[i - vbase] = s
del diffs
# Now we can generate samples until we hit N
while k <= N:
r = randint(k, N)
rv = v[r - vbase]
v[r - vbase] = v[k - vbase]
yield rv
k += 1
Note: N // 4 is probably pessimistic; computing the correct value would require knowing too much about hash-table implementation. If I really cared about speed, I'd write my own hash table implementation in a compiled language, and then I'd know :)

Find "important" entries in a sorted log

I have a log file consisting of several thousand integers, each separated onto a new line. I've parsed this into an array of such integers, also sorted. Now my issue becomes finding the "important" integers from this log--these are ones that show up some user-configurable portion of the time.
For example, given the log, the user can filter to only see entries that appear a certain scaled number of times.
Currently I'm scanning the whole array and keeping count of the number of times each entry appears. Surely there is a better method?
First, I need to note that the following is just a theoretical solution, and you probably should use what is proposed by #MBo.
Take every m = n / lth element of the sorted array. Only those elements can be important, as no sequence of identical elements of length m can fit between i*m and (i+1)*m.
For each element x, find with binary search its lower bound and upper bound in the array. Subtracting indexes, you can know count, and decide to keep or discard x as unimportant.
Total complexity would be O((n/m) * log n) = O(l * log n). For large m it could be (asymptotically) better than O(n). To get an improvement in practice, however, you need very specific circumstances:
Array is given to you presorted (otherwise just use counting sort and you get an answer immediately)
You can access i-th element of the array in O(1) without reading the whole array. Otherwise, again, use counting sort with hash table.
Lets assume you have a file consisting of sorted fixed-width integers "data.bin" (it is possible for variable width too, but requires some extra effort). Then in pseudocode, algorithm could be something like so:
def find_all_important(l, n):
m = n / l
for i = m to l step m:
x = read_integer_at_offset("data.bin", i)
lower_bound = find_lower_bound(x, 0, i)
upper_bound = find_upper_bound(x, i, n)
if upper_bound - lower_bound >= m:
report(x)
def find_lower_bound(x, begin, end):
if end - begin == 0:
return begin
mid = (end + begin) / 2
x = read_integer_at_offset("data.bin", mid)
if mid < x:
return find_lower_bound(x, mid + 1, end)
else:
return find_lower_bound(x, begin, mid)
As a guess, you will not gain any noticeable improvement compared to naive O(n) on modern hardware, unless your file is very large (hundreds of MBs). And of course it is viable if your data can't fit in RAM. But as always with optimization, it might be worth testing.
Your sorting takes O(NlogN) time perhaps. Do you need to make (n/I) queries many times for the same data set?
If yes, walk through sorted array, make (Value;Count) pairs and sort them by Count field. Now you can easily separate pairs with high counts with binary search

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

Given a sorted array, find the maximum subarray of repeated values

Yet another interview question asked me to find the maximum possible subarray of repeated values given a sorted array in shortest computational time possible.
Let input array be A[1 ... n]
Find an array B of consecutive integers in A such that:
for x in range(len(B)-1):
B[x] == B[x+1]
I believe that the best algorithm is dividing the array in half and going from the middle outwards and comparing from the middle the integers with one another and finding the longest strain of the same integers from the middle. Then I would call the method recursively by dividing the array in half and calling the method on the two halves.
My interviewer said my algorithm is good but my analysis that the algorithm is O(logn) is incorrect but never got around to telling me what the correct answer is. My first question is what is the Big-O analysis of this algorithm? (Show as much work as possible please! Big-O is not my forte.) And my second question is purely for my curiosity whether there is an even more time efficient algorithm?
The best you can do for this problem is an O(n) solution, so your algorithm cannot possibly be both correct and O(lg n).
Consider for example, the case where the array contains no repeated elements. To determine this, one needs to examine every element, and examining every element is O(n).
This is a simple algorithm that will find the longest subsequence of a repeated element:
start = end = 0
maxLength = 0
i = 0
while i + maxLength < a.length:
if a[i] == a[i + maxLength]:
while i + maxLength < a.length and a[i] == a[i + maxLength]:
maxLength += 1
start = i
end = i + maxLength
i += maxLength
return a[start:end]
If you have reason to believe the subsequence will be long, you can set the initial value of maxLength to some heuristically selected value to speed things along, and then only look for shorter sequences if you don't find one (i.e. you end up with end == 0 after the first pass.)
I think we all agree that in the worst case scenario, where all of A is unique or where all of A is the same, you have to examine every element in the array to either determine there are no duplicates or determine all the array contains one number. Like the other posters have said, that's going to be O(N). I'm not sure divide & conquer helps you much with algorithmic complexity on this one, though you may be able to simplify the code a bit by using recursion. Divide & conquer really helps cut down on Big O when you can throw away large portions of the input (e.g. Binary Search), but in the case where you potentially have to examine all the input, it's not going to be much different.
I'm assuming the result here is you're just returning the size of the largest B you've found, though you could easily modify this to return B instead.
So on the algorithm front, given that A is sorted, I'm not sure there's going to be any answer faster/simpler answer than just walking through the array in order. It seems like the simplest answer is to have 2 pointers, one starting at index 0 and one starting at index 1. Compare them and then increment them both; each time they're the same you tick a counter upward to give you the current size of B and when they differ you reset that counter to zero. You also keep around a variable for the max size of a B you've found so far and update it every time you find a bigger B.
In this algorithm, n elements are visited with a constant number of calculations per each visited element, so the running time is O(n).
Given sorted array A[1..n]:
max_start = max_end = 1
max_length = 1
start = end = 1
while start < n
while A[start] == A[end] && end < n
end++
if end - start > max_length
max_start = start
max_end = end - 1
max_length = end - start
start = end
Assuming that the longest consecutive integers is only of length 1, you'll be scanning through the entire array A of n items. Thus, the complexity is not in terms of n, but in terms of len(B).
Not sure if the complexity is O(n/len(B)).
Checking the 2 edge case
- When n == len(B), you get instant result (only checking A[0] and A[n-1]
- When n == 1, you get O(n), checking all elements
- When normal case, I'm too lazy to write the algo to analyze...
Edit
Given that len(B) is not known in advance, we must take the worst case, i.e. O(n)

Resources