Where can I use a technique from Majority Vote algorithm - algorithm

As seen in the answers to Linear time majority algorithm?, it is possible to compute the majority of an array of elements in linear time and log(n) space.
It was shown that everyone who sees this algorithm believes that it is a cool technique. But does the idea generalize to new algorithms?
It seems the hidden power of this algorithm is in keeping a counter that plays a complex role -- such as "(count of majority element so far) - (count of second majority so far)". Are there other algorithms based on the same idea?

Umh, let's first start to understand why the algorithm works, in order to "isolate" the ideas there.
The point of the algorithm is that if you have a majority element, then you can match each occurrence of it with an "another" element, and then you have some more "spare".
So, we just have a counter which counts the number of "spare" occurrences of our guest answer.
If it reaches 0, then it isn't a majority element for the subsequence starting from when we have "elected" the "current" element as the guest major element to the "current" position.
Also, since our "guest" element matches every other element occurrence in the considered subsequence, there are no major elements in the considered subsequence.
Now, since:
our algorithm gives a correct answer only if there is a major element, and
if there is a major element, then it'll still be if we ignore the "current" subsequence when the counter goes to zero
it is obvious to see by contradiction that, if a major element exists, then we have a suffix of the whole sequence when the counter never gets to zero.
Now: what's the idea that can be exploited in new, O(1) size O(n) time algorithms?
To me, you can apply this technique whenever you have to compute a property P on a sequence of elements which:
can be exteded from seq[n, m] to seq[n, m+1] in O(1) time if Q(seq[n, m+1]) doesn't hold
P(seq[n, m]) can be computed in O(1) time and space from P(seq[n, j]) and P(seq[j, m]) if Q(seq[n, j]) holds
In our case, P is the "spare" occurrences of our "elected" major element and Q is "P is zero".
If you see things in that way, longest common subsequence exploits the same idea (dunno about its "coolness factor" ;))

Jaydev Misra and David Gries have a paper called Finding Repeated Elements (ACM page) which generalizes it to an element repeating more than n/k times (k=2 is the majority problem).
Of course, this is probably very similar to the original problem, and you are probably looking for 'different' algorithms.
Here is an example which is possibly different.
Give an algorithm which will detect if a string of parentheses ( '(' and ')') is well formed.
I believe the standard solution is to maintain a counter.
Side note:
As to answers which claim cannot be constant space etc, ask them for the model of computation. In the WORD RAM model for instance, you assume the integers/array indices etc are O(1).
A lot of folks incorrectly mix and match models. For instance, they will happily have the input array of n integers be O(n), have an array index be O(1) space, but a counter they consider Omega(log n) etc, which is nonsense. If they want to consider the size in bits, then the input itself is Omega(n log n) etc.

For people who want to understand what does this algorithm do and why does it works: look at my detailed answer.
Here I will describe a natural extension of this algorithm (or a generalization). So in a standard majority voting algorithm you have to find an element which appears at least n/2 times in the stream, where n is the size of the stream. You can do this in O(n) time (with a tiny constant and in O(log(n)) space, worse case and highly unlikely.
The generalized algorithm allows you to find k most frequent items, where each time appeared at least n/(k+1) times in the original stream. Note that if k=1, you end up with your original problem.
Solution to this problem is really similar to the original one, except instead of one counter and one possible element, you maintain k counters and k possible elements. Now the logic goes in a similar way. You iterate through the array and if the element is in the possible elements, you increase it's counter, if one of the counters is zero - substitute the element of this counter with new element. Otherwise just decrease the values.
As with original majority voting algorithm, you need to have a guarantee that you have these k majority elements, otherwise you have to do another pass over the array to verify that your previously found possible elements are correct. Here is my python attempt (have not done a thorough testing).
from collections import defaultdict
def majority_element_general(arr, k=1):
counter, i = defaultdict(int), 0
while len(counter) < k and i < len(arr):
counter[arr[i]] += 1
i += 1
for i in arr[i:]:
if i in counter:
counter[i] += 1
elif len(counter) < k:
counter[i] = 1
else:
fields_to_remove = []
for el in counter:
if counter[el] > 1:
counter[el] -= 1
else:
fields_to_remove.append(el)
for el in fields_to_remove:
del counter[el]
potential_elements = counter.keys()
# might want to check that they are really frequent.
return potential_elements

Related

constant memory reservoir sampling, O(k) possible?

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I was going to use is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.
If it is a real stream, you need O(n) time to scan it.
Your existing algorithm is good. (I got that wrong before.) You can prove by induction that the probability that you have not picked the first element in i tries is 1 - i/n = (n-i)/n. First that is true for i=0 by inspection. Now if you have not picked it in ith tries, the odds that the next one picks it is 1/(n-i). And then the odds of picking it on the i+1'th try is ((n-i)/n) * (1/(n-i)) = 1/n. Which means that the odds of not picking it in the first i+1 times is 1 - i/n - 1/n = 1 - (i+i)/n. That completes induction. And so the odds of picking the first element in the first k tries is the odds of not having not picked it, or 1 - (n - k/n) = k/n.
But what if you have O(1) access to any element? Well note that choosing k to take is the same as choosing n-k to leave. So without loss of generality we can assume that k <= n/2. What that means is that we can use a randomized algorithm like this:
chosen = set()
count_chosen = 0
while count_chosen < k:
choice = random_element(stream)
if choice not in chosen:
chosen.add(choice)
count_chosen = count_chosen + 1
The set will be O(k) space, and since the probability of each random choice being new to you is at least 0.5, the expected running time is no worse than 2k choices.

Minimal non-contiguous sequence of exactly k elements

The problem I'm having can be reduced to:
Given an array of N positive numbers, find the non-contiguous sequence of exactly K elements with the minimal sum.
Ok-ish: report the sum only. Bonus: the picked elements can be identified (at least one set of indices, if many can realize the same sum).
(in layman terms: pick any K non-neighbouring elements from N values so that their sum is minimal)
Of course, 2*K <= N+1 (otherwise no solution is possible), the problem is insensitive to positive/negative (just shift the array values with the MIN=min(A...) then add back K*MIN to the answer).
What I got so far (the naive approach):
select K+2 indexes of the values closest to the minimum. I'm not sure about this, for K=2 this seems to be the required to cover all the particular cases, but I don't know if it is required/sufficient for K>2**
brute force the minimal sum from the values of indices resulted at prev step respecting the non-contiguity criterion - if I'm right and K+2 is enough, I can live brute-forcing a (K+1)*(K+2) solution space but, as I said. I'm not sure K+2 is enough for K>2 (if in fact 2*K points are necessary, then brute-forcing goes out of window - the binomial coefficient C(2*K, K) grows prohibitively fast)
Any clever idea of how this can be done with minimal time/space complexity?
** for K=2, a non-trivial example where 4 values closest to the absolute minimum are necessary to select the objective sum [4,1,0,1,4,3,4] - one cannot use the 0 value for building the minimal sum, as it breaks the non-contiguity criterion.
PS - if you feel like showing code snippets, C/C++ and/or Java will be appreciated, but any language with decent syntax or pseudo-code will do (I reckon "decent syntax" excludes Perl, doesn't it?)
Let's assume input numbers are stored in array a[N]
Generic approach is DP: f(n, k) = min(f(n-1, k), f(n-2, k-1)+a[n])
It takes O(N*K) time and has 2 options:
for lazy backtracking recursive solution O(N*K) space
for O(K) space for forward cycle
In special case of big K there is another possibility:
use recursive back-tracking
instead of helper array of N*K size use map(n, map(k, pair(answer, list(answer indexes))))
save answer and list of indexes for this answer
instantly return MAX_INT if k>N/2
This way you'll have lower time than O(NK) for K~=N/2, something like O(Nlog(N)). It will increase up to O(N*log(N)Klog(K)) for small K, so decision between general approach or special case algorithm is important.
There should be a dynamic programming approach to this.
Work along the array from left to right. At each point i, for each value of j from 1..k, find the value of the right answer for picking j non-contiguous elements from 1..i. You can work out the answers at i by looking at the answers at i-1, i-2, and the value of array[i]. The answer you want is the answer at n for an array of length n. After you have done this you should be able to work out what the elements are by back-tracking along the array to work out whether the best decision at each point involves selecting the array element at that point, and therefore whether it used array[i-1][k] or array[i-2][k-1].

searching through a vast collection of potential solutions

I have a quite difficult problem (perhaps even a NP-hard problem ^^) with looking for a solution in a massive collection of results. Perhaps there is an algorithm for it.
Below exercise is artificial but is a perfect example to illustrate my issue.
There is a big array with integers. Lets say it has 100.000 elements.
int numbers[] = {-123,32,4,-234564,23,5,....}
I want to check in a relatively quick way if a sum on any 2 numbers from this array is equal to 0. In other words, if the array has "-123" I want to find is there also a "123" number.
The easiest solution would be brute force - check everything with everything. That gives 100.000 x 100.000 a big number ;-) Obviously brute force method can by optimised. Order numbers and check negatives against positive only. My question is - is there something better then optimised brute force to find a solution?
First, sort the array by magnitude of the value.
Then, if the data contains a pair which satisfies the conditions you're after, it contains such a pair adjacent in the array. So just sweep through looking for adjacent pairs whose sum is 0.
Overall time complexity is O(n log n) for the sort, could be O(n) if you use "cheating" sorts not based solely on comparisons. Clearly it can't be done in less than linear time, because in the worst case you can't do it without looking at all the elements. I think n log n is probably optimal in the decision tree model of computing, but only because it "feels a bit like" the element uniqueness problem.
Alternative approach:
Add the elements one at a time to a hash-based or tree-based container. Before adding each element, check whether its negative is present. If so, stop.
This is likely to be faster in the case where there are lots of suitable pairs, because you save the cost of sorting the whole data. That said, you could write a modified sort that exits early by checking for adjacent pairs as soon as any subset of the data is in its final order, but that's effort.
Brute force would be an O(n^2) solution. You can certainly do better.
Off the top of my head, first sort it. Heap sort will have a complexity of O(nlogn).
Now, for the first element, say a, you know you need to find an element b, such that a+b = 0. This can be found using binary search (since your array is now sorted). Binary search has a complexity of O(logn).
This gives you an overall solution of O(nlogn) complexity.
The example you provided can be brute-force solved in O(n^2) time.
You can start ordering the numbers (O(n·logn)) from smaller to bigger. If you place one pointer at the beginning (the "most negative number") and other at the end (the "most positive"), you can check if there is such pair of numbers in an additional O(n) steps by following the next procedure:
If the numbers at both pointers have the same module, you have the solution
If not, move the pointer of the number with bigger module towards "zero" (this is, increase if it is the pointer on the negative side, decrease if it is the positive-side one)
Repeat until finding a solution, or the pointers cross.
Total complexity is O(n·logn)+O(n) = O(n·logn).
Sort your array using Quicksort. After this happened, use two indexes, let's call them positive and negative.
positive <- 0
negative <- size - 1
while ((array[positive] > 0) and (array(negative < 0) and (positive >= 0) and (negative < size)) do
delta <- array[positive] + array[negative]
if (delta = 0) then
return true
else if (delta < 0) then
negative <- negative + 1
else
positive <- positive - 1
end if
end while
return (array[positive] * array[negative] = 0)
You didn't say what should the algorithm do if 0 is part of the array, I've supposed that in this case true should be returned.

First pair of numbers adding to a specific value in a stream

There are a stream of integers coming through. The problem is to find the first pair of numbers from the stream that adds to a specific value (say, k).
With static arrays, one can use either of the below approaches:
Approach (1): Sort the array, use two pointers to beginning and end of array and compare.
Approach (2): Use hashing, i.e. if A[i]+A[j]=k, then A[j]=k-A[i]. Search for A[j] in the hash table.
But neither of these approaches scale well for streams. Any thoughts on efficiently solving this?
I believe that there is no way to do this that doesn't use at least O(n) memory, where n is the number of elements that appear before the first pair that sums to k. I'm assuming that we are using a RAM machine, but not a machine that permits awful bitwise hackery (in other words, we can't do anything fancy with bit packing.)
The proof sketch is as follows. Suppose that we don't store all of the n elements that appear before the first pair that sums to k. Then when we see the nth element, which sums with some previous value to get k, there is a chance that we will have discarded the previous element that it pairs with and thus won't know that the sum of k has been reached. More formally, suppose that an adversary could watch what values we were storing in memory as we looked at the first n - 1 elements and noted that we didn't store some element x. Then the adversary could set the next element of the stream to be k - x and we would incorrectly report that the sum had not yet been reached, since we wouldn't remember seeing x.
Given that we need to store all the elements we've seen, without knowing more about the numbers in the stream, a very good approach would be to use a hash table that contains all of the elements we've seen so far. Given a good hash table, this would take expected O(n) memory and O(n) time to complete.
I am not sure whether there is a more clever strategy for solving this problem if you make stronger assumptions about the sorts of numbers in the stream, but I am fairly confident that this is asymptotically ideal in terms of time and space.
Hope this helps!

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Resources