Efficient algorithm to calculate the mode of a hidden array - algorithm

I'm trying to solve the extension to a problem I described in my question: Efficient divide-and-conquer algorithm
For this extension, there is known to be representatives for 3 parties at the event, and there are more members for 1 party attending than for any other. A formal description of the problem can be found below.
You are given an integer n. There is a hidden array A of size n, which contains elements that can take 1 of 3 values. There is a value, let this be m, that appears more often in the array than the other 2 values.
You are allowed queries of the form introduce(i, j), where i≠j, and 1 <= i, j <= n, and you will get a boolean value in return: You will get back 1, if A[i] = A[j], and 0 otherwise.
Output: B ⊆ [1, 2. ... n] where the A-value of every element in B is m.
A brute-force solution to this could calculate B in O(n2) by calling introduce(i, j) on n(n-1) combinations of elements and create 3 lists containing A-indexes of elements for which a 1 was returned when introduce was called on them, returning the list of largest size.
I understand the Boyer–Moore majority vote algorithm but can't find a way to modify it for this problem or find an efficient algorithm to solve it.

Scan for all A[i] = A[0], and make list I[] of all i for which A[i] != A[0]. Then scan for all A[I[j]] = A[I[0]], and so on. Which requires one O(n) scan for each possible value in A[].
[I assume if introduce(i, j) = 1 and introduce(j, k) = 1, then introduce(i, k) = 1 -- so you don't need to check all combinations of elements.]
Of course, this doesn't tell you what 'm' is, it just makes n lists, where n is the number of values, and each list is all the 'i' where A[i] is the same.

Related

Sample number with equal probability which is not part of a set

I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)

Divide linked-list into 2 sublists with equal sum

I'm trying to divide a linked-list into 2 sublists with equal sum. These sublists do not need to consist of consecutive elements.
I have a linked list as
Eg.1
LinkedList={1,7,5,5,4}
should be divided into
LinkedList1={1,5,5}
LinkedList2={7,4}
Both have the same sum of elements as 11.
Eg.2
LinkedList={42,2,3,2,2,2,5,20,2,20}
This should be divided into two list of equal sum i.e 50.
LinkedList1={42,3,5}
LinkedList2={2,2,2,2,20,2,20}
Can someone provide some pseudocode to solve this problem?
This is what I've thought so far:
Sum the elements of linked list and divide by 2.
Now till the sum of your linkedlist1 is less than the sum of linkedlist/2 keep pushing elements into linkedlist1.
If not equal and less than linkedlist sum/2 move to the next element and the current element can be pushed to the linkedlist2.
But this would only work if the elements are in a particular order.
This is known as the partition problem.
There are a few approaches to solving the problem, but I'll just mention the most common 2 below (see Wikipedia for more details on either approach or other approaches).
This can be solved with a dynamic programming approach, which basically comes down to, for each element and value, either including or excluding that element, and looking up whether there's a subset summing to the corresponding value. More specifically, we have the following recurrence relation:
p(i, j) is True if a subset of { x1, ..., xj } sums to i and False otherwise.
p(i, j) is True if either p(i, j − 1) is True or if p(i − xj, j − 1) is True
p(i, j) is False otherwise
Then p(N/2, n) tells us whether a subset exists.
The running time is O(Nn) where n is the number of elements in the input set and N is the sum of elements in the input set.
The "approximate" greedy approach (doesn't necessarily find an equal-sum partition) is pretty straight-forward - it just involves putting each element in the set with the smallest sum. Here's the pseudo-code:
INPUT: A list of integers S
OUTPUT: An attempt at a partition of S into two sets of equal sum
1 function find_partition( S ):
2 A ← {}
3 B ← {}
4 sort S in descending order
5 for i in S:
6 if sum(A) <= sum(B)
7 add element i to set A
8 else
9 add element i to set B
10 return {A, B}
The running time is O(n log n).

Minimal number of swaps?

There are N characters in a string of types A and B in the array (same amount of each type). What is the minimal number of swaps to make sure that no two adjacent chars are same if we can only swap two adjacent characters ?
For example, input is:
AAAABBBB
The minimal number of swaps is 6 to make the array ABABABAB. But how would you solve it for any kind of input ? I can only think of O(N^2) solution. Maybe some kind of sort ?
If we need just to count swaps, then we can do it with O(N).
Let's assume for simplicity that array X of N elements should become ABAB... .
GetCount()
swaps = 0, i = -1, j = -1
for(k = 0; k < N; k++)
if(k % 2 == 0)
i = FindIndexOf(A, max(k, i))
X[k] <-> X[i]
swaps += i - k
else
j = FindIndexOf(B, max(k, j))
X[k] <-> X[j]
swaps += j - k
return swaps
FindIndexOf(element, index)
while(index < N)
if(X[index] == element) return index
index++
return -1; // should never happen if count of As == count of Bs
Basically, we run from left to right, and if a misplaced element is found, it gets exchanged with the correct element (e.g. abBbbbA** --> abAbbbB**) in O(1). At the same time swaps are counted as if the sequence of adjacent elements would be swapped instead. Variables i and j are used to cache indices of next A and B respectively, to make sure that all calls together of FindIndexOf are done in O(N).
If we need to sort by swaps then we cannot do better than O(N^2).
The rough idea is the following. Let's consider your sample: AAAABBBB. One of Bs needs O(N) swaps to get to the A B ... position, another B needs O(N) to get to A B A B ... position, etc. So we get O(N^2) at the end.
Observe that if any solution would swap two instances of the same letter, then we can find a better solution by dropping that swap, which necessarily has no effect. An optimal solution therefore only swaps differing letters.
Let's view the string of letters as an array of indices of one kind of letter (arbitrarily chosen, say A) into the string. So AAAABBBB would be represented as [0, 1, 2, 3] while ABABABAB would be [0, 2, 4, 6].
We know two instances of the same letter will never swap in an optimal solution. This lets us always safely identify the first (left-most) instance of A with the first element of our index array, the second instance with the second element, etc. It also tells us our array is always in sorted order at each step of an optimal solution.
Since each step of an optimal solution swaps differing letters, we know our index array evolves at each step only by incrementing or decrementing a single element at a time.
An initial string of length n = 2k will have an array representation A of length k. An optimal solution will transform this array to either
ODDS = [1, 3, 5, ... 2k]
or
EVENS = [0, 2, 4, ... 2k - 1]
Since we know in an optimal solution instances of a letter do not pass each other, we can conclude an optimal solution must spend min(abs(ODDS[0] - A[0]), abs(EVENS[0] - A[0])) swaps to put the first instance in correct position.
By realizing the EVENS or ODDS choice is made only once (not once per letter instance), and summing across the array, we can count the minimum number of needed swaps as
define count_swaps(length, initial, goal)
total = 0
for i from 0 to length - 1
total += abs(goal[i] - initial[i])
end
return total
end
define count_minimum_needed_swaps(k, A)
return min(count_swaps(k, A, EVENS), count_swaps(k, A, ODDS))
end
Notice the number of loop iterations implied by count_minimum_needed_swaps is 2 * k = n; it runs in O(n) time.
By noting which term is smaller in count_minimum_needed_swaps, we can also tell which of the two goal states is optimal.
Since you know N, you can simply write a loop that generates the values with no swaps needed.
#define N 4
char array[N + N];
for (size_t z = 0; z < N + N; z++)
{
array[z] = 'B' - ((z & 1) == 0);
}
return 0; // The number of swaps
#Nemo and #AlexD are right. The algorithm is order n^2. #Nemo misunderstood that we are looking for a reordering where two adjacent characters are not the same, so we can not use that if A is after B they are out of order.
Lets see the minimum number of swaps.
We dont care if our first character is A or B, because we can apply the same algorithm but using A instead of B and viceversa everywhere. So lets assume that the length of the word WORD_N is 2N, with N As and N Bs, starting with an A. (I am using length 2N to simplify the calculations).
What we will do is try to move the next B right to this A, without taking care of the positions of the other characters, because then we will have reduce the problem to reorder a new word WORD_{N-1}. Lets also assume that the next B is not just after A if the word has more that 2 characters, because then the first step is done and we reduce the problem to the next set of characters, WORD_{N-1}.
The next B should be as far as possible to be in the worst case, so it is after half of the word, so we need $N-1$ swaps to put this B after the A (maybe less than that). Then our word can be reduced to WORD_N = [A B WORD_{N-1}].
We se that we have to perform this algorithm as most N-1 times, because the last word (WORD_1) will be already ordered. Performing the algorithm N-1 times we have to make
N_swaps = (N-1)*N/2.
where N is half of the lenght of the initial word.
Lets see why we can apply the same algorithm for WORD_{N-1} also assuming that the first word is A. In this case it matters than the first word should be the same as in the already ordered pair. We can be sure that the first character in WORD_{N-1} is A because it was the character just next to the first character in our initial word, ant if it was B the first work can perform only a swap between these two words and or none and we will already have WORD_{N-1} starting with the same character than WORD_{N}, while the first two characters of WORD_{N} are different at the cost of almost 1 swap.
I think this answer is similar to the answer by phs, just in Haskell. The idea is that the resultant-indices for A's (or B's) are known so all we need to do is calculate how far each starting index has to move and sum the total.
Haskell code:
Prelude Data.List> let is = elemIndices 'B' "AAAABBBB"
in minimum
$ map (sum . zipWith ((abs .) . (-)) is) [[1,3..],[0,2..]]
6 --output

Given a set of n integers, list all possible subsets with sum>=k

Given an unsorted set of integers in the form of array, find all possible subsets whose sum is greater than or equal to a const integer k,
eg:- Our set is {1,2,3} and k=2
Possible subsets:-
{2},
{3},
{1,2},
{1,3},
{2,3},
{1,2,3}
I can only think of a naive algorithm which lists all the subsets of set and checks if sum of subset is >=k or not, but its an exponential algorithm and listing all subsets requires O(2^N). Can I use dynamic programming to solve it in polynomial time?
Listing all the subsets is going to be still O(2^N) because in the worst case you may still have to list all subsets apart from the empty one.
Dynamic programming can help you count the number of sets that have sum >= K
You go bottom-up keeping track of how many subsets summed to some value from range [1..K]. An approach like this will be O(N*K) which is going to be only feasible for small K.
The idea with the dynamic programming solution is best illustrated with an example. Consider this situation. Assume you know that out of all the sets composed of the first i elements you know that t1 sum to 2 and t2 sum to 3. Let's say that the next i+1 element is 4. Given all the existing sets we can build all the new sets by either appending the element i+1 or leaving it out. If we leave it out we get t1 subsets that sum to 2 and t2 subsets that sum to 3. If we append it then we obtain t1 subsets that sum to 6 (2 + 4) and t2 that sum to 7 (3 + 4) and one subset which contains just i+1 which sums to 4. That gives us the numbers of subsets that sum to (2,3,4,6,7) consisting of the first i+1 elements. We continue until N.
In pseudo-code this could look something like this:
int DP[N][K];
int set[N];
//go through all elements in the set by index
for i in range[0..N-1]
//count the one element subset consisting only of set[i]
DP[i][set[i]] = 1
if (i == 0) continue;
//case 1. build and count all subsets that don't contain element set[i]
for k in range[1..K-1]
DP[i][k] += DP[i-1][k]
//case 2. build and count subsets that contain element set[i]
for k in range[0..K-1]
if k + set[i] >= K then break inner loop
DP[i][k+set[i]] += DP[i-1][k]
//result is the number of all subsets - number of subsets with sum < K
//the -1 is for the empty subset
return 2^N - sum(DP[N-1][1..K-1]) - 1
Can I use dynamic programming to solve it in polynomial time?
No. The problem is even harder than #amit (in the comments) mentions. Finding if there exists a subset that sums to a specific k is the subset-sum problem, which is NP-hard. Instead you are asking for how many solutions are equal to a specific k, which is in the much more difficult class of P#. In addition, your exact problem is slightly more difficult since you want to not only count, but enumerate all the possible subsets for k and targets < k.
If k is 0, and every element of the set is positive then you have no choice but to output every possible subset, so the lower-bound to this problem is O(2N) -- the time taken to produce the output.
Unless you know something more about the value k that you haven't told us, there's no faster general solution that to just check every subset.

Finding kth smallest element in union of 2 sorted array

I think this question was asked so many times, but still there aren't any clear solution!
Anyways, this is what I found as good answer in O(k) (possibly O(logm + logn) too). But I don't understand part, where if M_B > M_A (or other way round) we should be throwing away after elements after M_B. But here its reverse - throwing elements which are before M_B. Can anyone please explain why?
http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15451-s01/recitations/rec03/rec03.ps
And other question is doing K/2 ... we should be doing it, but it isn't obvious to me.
[EDIT 1]
Example
A = [2, 9, 15, 22, 24, 25, 26, 30]
B = [1, 4, 5, 7, 18, 22, 27, 33]
k= 6
Answer is 9 (A[1])
Here is what I think, if I want to solve in O(Log k) ... need to throw k/2 elements each time.
Base solution: if K < 2: return 2nd smallest element from - A[0], A[1], B[0], B[1]
else:
compare A[k/2] and B[k/2]: if A[k/2] < B[k/2]: then kth smallest element will be in A[1 ... n] and B[1 ... K/2] ... okay here I thrower k/2 (can do similar for A[k/2] > B[k/2]. so now question is next time also k index is K or k/2?
What I'm doing is right?
That algorithm isn't bad -- it's better than the one which is usually referenced here on SO, in my opinion, because it's a lot simpler -- but it has one huge flaw: it requires that both vectors have at least k elements. (The problem says that they both have the same number of elements, n, but never specifies that n ≥ k; the function doesn't even let you tell it how big the vectors are. However, that's easily solved. I'll leave it as an exercise for now. In general, we'd need an algorithm like this to work on differently-sized arrays, and it does; we just need to be clear on the preconditions.)
The use of floor and ceil is nice and specific, but maybe confusing. Let's just look at this in the most general way. Also, the solution quoted seems to assume that arrays are 1-indexed (i.e. A[1] is the first element, not A[0]). The description I'm about to write, however, uses a more C-like pseudocode, so it assumes that A[0] is the first element. Consequently, I'm going to write it to find element k in the combined set, which is the (k+1)th element. And finally, the solution I'm about to describe differs subtly from the solution presented, which will be apparent in the end condition. IMHO, it's slightly better.
OK, if x is element k in a sequence, there are exactly k elements in the sequence smaller than x. (We won't deal with the case where there are repeated elements, but it's not much different. See note 3.)
Suppose that we know that A and B each have an element k. (Remember, this means they each have at least k + 1 elements.) Select any non-negative integer less than k; we'll call it i. And let j be k - i - 1 (so that i + j == k - 1). [See note 1, below.] Now, look at elements A[i] and B[j]. Let's say A[i] is smaller, since we just have to change all the names in the other case. Remember that we're assuming all the elements are different. So here's what we know at this point:
1) There are i elements in A which are < A[i]
2) There are j elements in B which are < B[j]
3) A[i] < B[j]
4) From (2) and (3), we know that:
5) There are at most j elements in B which are < A[i]
6) From (1) and (5), we know that:
7) There are at most i + j elements in A and B together which are < A[i]
8) But i + j is k - 1, so actually we know:
9) Element k of the merged array must be greater than A[i] (because A[i] is at most element i + j).
Since we know that the answer must be greater than A[i], we can discard A[0] through A[i] (actually, we just increment an array pointer, but effectively we'll discard them). However, we've now discarded i + 1 elements from the original problem. So out of the new set of elements (in the shortened A and the original B), we need element k - (i + 1), instead of the element k.
Now, let's check the precondition. We said that both A and B had an element k elements to start with, so they both have at least k + 1 elements. In the new problem we want to know whether the shortened A and the original B each have at least k - i elements. Clearly B does, because k - i is no greater k. Also, we removed i + 1 elements from A. Originally it had at least k + 1 elements, so now it has at least k - i elements. So we're OK there.
Finally, let's check the termination condition. At the beginning I said that we choose non-negative integers i and j so that i + j == k - 1. That's not possible if k == 0, but it can be done for k == 1. So we only need to do something special once k reaches 0, in which case what we need to do is return min(A[0], B[0]). [This is a much simpler termination condition than in the algorithm you looked at, see Note 2.]
So what's a good strategy for picking i? We'll end up removing either i + 1 or k - i elements from the problem, and we'd like that to be as close to half of the elements as possible. So we should choose i = floor((k - 1) / 2). Although it might not be immediately obvious, that will make j = floor(k / 2).
I'm leaving out the bit where I solve the case where A and B have fewer elements. It's not complicated; I'd encourage you to think about it yourself.
[1] The algorithm you were looking at selects i + j == k (if k is even), and drops either i or j elements. Mine selects i + j == k - 1 (always) which might make one of them smaller, but then it drops i + 1 or j + 1 elements. So it should converge slightly more rapidly.
[2] The difference between selecting i + j == k (theirs) and i + j == k - 1 (mine) is apparent in the end condition. In their formulation, both i and j must be positive, because if one of the were 0, there is a risk of dropping 0 elements, which would be an infinite recursive loop. So in their formulation, the minimum possible value of k is 2, not 1, and so their termination case has to handle k == 1, which involves comparing between four elements, rather than two. For what it's worth, I believe the best solution of "find the second smallest element out of two sorted vectors" is: min(max(A[0], B[0]), min(A[1], B[1])), which requires three comparisons. This doesn't make their algorithm slower; just more complicated.
[3] Suppose elements could repeat. Actually this doesn't change anything. The algorithm still works. Why? Well, we could pretend that every element in A was actually a pair with its actual value and its actual index, and similarly for every element in B, and that we use the index as a tie breaker when comparing values within a vector. Between vectors, we give preference to all the elements in A if A[i] ≤ B[j]; otherwise to all the elements in B. This doesn't actually change the actual code at all, because we never actually have to do any comparison differently, but it makes all the inequalities in the proof valid.

Resources