algorithm to find the three majority elements in an array - algorithm

Let's say there are three elements in a non-sorted array all of which appear more than one-fourth times of the total number of elements.
What is the most efficient way to find these elements? Both for non-online and online versions of this question.
Thank you!
Edit
The non-online version I was referring to is: this array is specified in full. The online version means the array elements are coming one at a time.
I require the space in addition to time complexity to be tight.
disclaimer: THIS IS NOT HOMEWORK! I consider this as research-level question.

Remember up to three elements, together with counters.
remember first element, set count1 = 1
scan until you find first different element, increasing count1 for each occurrence of element 1
remember second elemt, set count2 = 1
scan until you find first element different from elem1 and elem2, incrementing count1 or count2, depending which element you see
remember third element, set count3 = 1
continue scanning, if the element is one of the remembered, increment its count, if it's none of the remembered, decrement all three counts; if a count drops to 0, forget the element, go to step 1, 3, or 5, depending on how many elements you forget
If you have three elements occurring strictly more than one-fourth times the number of elements in the array, you will end up with three remembered elements each with positive count, these are the three majority elements.
Small constant additional space, O(n), no sorting.

create a histogram of the entries, sort it, and take the three largest entries.

Related

Find element of an array that appears only once in O(logn) time

Given an array A with all elements appearing twice except one element which appears only once. How do we find the element which appears only once in O(logn) time? Let's discuss two cases.
Array is always sorted and elements are in sequential order. Let's assume A = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 6], we want to find 3 in log n time because it appears only once.
When the array is not sorted and the elements are not in sequential order.
I can only come up with a solution of using the XOR operator on the binary representation of the integers as explained Here, and at the end, the binary string will represent the element which appears only once because duplicates will cancel out. But it takes O(n) time. How can we do better than that?
using Haroon S' comment this is the solution which I think is correct, given the constraints for time.
class Solution:
def singleNonDuplicate(self, nums: List[int]) -> int:
low = 0
high = len(nums)-1
while(low<high):
mid = (low+high)//2
if(mid%2==0):
mid+=1
if(nums[mid]==nums[mid+1]):
# answer in second half
high = mid-1
elif(nums[mid]==nums[mid-1]):
# answer in first half
low = mid+1
return nums[low]
If the elements are sorted (i.e., the first case you mentioned) then I believe a strategy not unlike binary search could work in O(logN) time.
Starting from the left endpoint in a sorted array, until we encounter the unique element, all the index pairs (2i, 2i + 1) we encounter along the way will have the same value. (i.e., due to the array being sorted) However, as we go towards the right endpoint of the array, as soon as we consider an array that includes the unique element, that structure of "same values within (2i, 2i+1) index pairs" will be invalid.
Using that information, a search algorithm similar to binary search can find out in which half of the array the unique element is. Basically, you can deduce that, "in the left half of the array, if the values in the rightmost index pair (2i, 2i+1) are the same, then the unique value is in the right half". (i.e., with the exception of the last index on the left half-array being even; but you can overcome that case with various O(1) time operations)
The overall complexity then becomes O(logN), due to the halving of the array size at each step.
For the demonstration of the index notion I mentioned above, see your own example. In the left of the unique element(i.e. 3) all index pairs (2i, 2i+1) have the same values. And all subarrays starting from index 0 and ending with an index that is to the right of the unique element, all index pairs (2i, 2i+1) have a correspond to cells that contain different values.
Unless the array is sorted, though, since you'd have to investigate each and every element, I believe any algorithm you may come up with would take at least O(n) time. This is what I think will happen in the second case you mention in your question.
In the general case this is impossible, as to make sure an element doesn't repeat you need to check every other element.
From your example, it seems the array might be a sorted sequence of integers with no "gaps" (or some other clearly defined sequence, like all even numbers, etc). In this case it is possible with a modified binary search.
You have the array [1,1,2,2,3,4,4,5,5,6,6].
You check the middle element and the element following it and see 3 and 4. Now you know there are only 5 elements from the set {1, 2, 3}, while there are 6 elements from the set {4, 5, 6}. Which means, the missing elements is in {1, 2, 3}.
Then you recurse on [1,1,2,2,3]. You see 2,2. Now you know there are 2 "1" elements and 1 "3" element, so 3 is the answer.
The reason you check 2 elements in each step is that if you see just "3", you don't know whether you hit the first 3 in "3,3" or the second one. But if you read 2 elements you always find a "boundary" between 2 different elements.
The condition for this to be viable is that, given the value of an element, you need to be able to calculate in O(1) how many different elements come before this element. In your case this is trivial, but it is also possible for any arithmetic series, geometric series (with fixed size numbers)...
This is not a O(log n) solution. I have no idea how to solve it in logarithmic time without the constraints that the array is sorted and we have a known difference between consecutive numbers so we can recognise when we are to the left or right of the singleton. The other solutions already deal with that special case and I couldn’t do better there either.
I have a suggestion that might solve the general case in O(n), rather than O(n log n) when you first sort the array. It’s not as fast as the xor solution, but it will also work for non-integers. The elements must have an order, so it is not completely general, but it will work anywhere you can sort the elements.
The idea is the same as the k’th order element algorithm based on Quicksort. You partition and recurse on one half of the array. The time recurrence is T(n) = T(n/2) + O(n) = O(n).
Given array x and indices i,j, representing sub-array x[i:j], partition with quicksort’s partitioning method. You want a variant that partitions x[i:j] into three segments, x[i:k] x[k:l], x[l:j] where all elements in the first part are smaller than the pivot (whatever it is) all elements in x[k:l] are equal to the pivot, and all elements in the last segment are greater than the pivot.
(you might be able to use a version that only partitions in two, or explicitly count the number of pivots, but with this version is easier to work with here)
Now, if the middle segment has length one, you have your singleton. It is the pivot.
If not, the length of the segment that has the singleton is odd while the other is even. So recurse on the segment with the odd length.
It doesn’t give you worst case linear time, for the same reason that Quicksort isn’t worst case log-linear, but you get an expected linear time algorithm and likely a fast one at that.
Not, of course, as fast as those solutions based on binary search, but here the elements do not need to be sorted and we can handle elements with arbitrary gaps between them. We are also not restricted to data where we can easily manipulate their bit-patterns. So it is more general. If you can compare the elements, this approach will find the singleton in O(n).
This solution will find the element in the array that appeared only once but there should not be more than one element of that type and the array should be sorted. This is Binary Search and will return the element in O(log n) time.
var singleNonDuplicate = function(nums) {
let s=0,e= nums.length-1
while(s < e){
let mid = Math.trunc(s+(e-s)/2)
if((mid%2 == 0&& nums[mid] ==nums[mid+1])||(mid%2==1 && nums[mid] == nums[mid-1]) ){
s= mid+1
}
else{
e = mid
}
}
return nums[s] // can return nums[e] also
};
I don't believe there is a O(log n) solution for that. The reason is that in order to find which element is appearing only once, you at least need to iterate over the elements of that array once.

Divide the list into 2 equal Parts

I have a list which contains random numbers such that Number >= 0. Now i have to divide the list into 2 equal parts (assume list contains even number of elements) such that all the numbers contain in first list are less than the numbers present in second list. This can be easily done by any sorting mechanism in O(nlogn). But i don't need data to be sorted in any two equal length list. Only condition is that (all elements in first list <= all elements in second list.)
So is there a way or hack we can reduce the complexity since we don't require sorted data here?
If the problem is actually solvable (data is right) you can find the median using the selection algorithm. When you have that you just create 2 equally sized arrays and iterate over the original list element by element putting each element into either of the new lists depending whether it's bigger or smaller than the median. Should run in linear time.
#Edit: as gen-y-s pointed out if you write the selection algorithm yourself or use a proper library it might already divide the input list so no need for the second pass.

Sort a given array based on parent array using only swap function

It is a coding interview question. We are given an array say random_arr and we need to sort it using only the swap function.
Also the number of swaps for each element in random_arr are limited. For this you are given an array parent_arr, containing number of swaps for each element of random_arr.
Constraints:
You should use swap function.
Every element may repeat minimum 5 times and maximum 26 times.
You cannot make elements of given array to 0.
You should not write helper functions.
Now I will explain how parent_arr is declared. If parent_arr is like:
parent_arr[] = {a,b,c,d,...,z} then
a can be swapped at most one time.
b can be swapped at most two times.
if parent_arr[] = {c,b,a,....,z} then
c can be swapped at most one time.
b can be swapped at most two times.
a can be swapped at most three times
My solution:
For each element in random_arr[] store that how many elements are below it, if it is sorted. Now select element having minimum swap count from parent_arr[] and check whether it exist in random_arr[]. If yes and it if has occurred more than one time then it will have more than one location where it can be placed. Now choose the position(rather element at that position, preciously) with maximum swap count and swap it. Now decrease the swap count for that element and sort the parent_arr[] and repeat the process.
But it is quite inefficient and its correctness can't be proved. Any ideas?
First, let's simplify your algorithm; then let's informally prove its correctness.
Modified algorithm
Observe that once you computed the number of elements below each number in the sorted sequence, you have enough information to determine for each group of equal elements x their places in the sorted array. For example, if c is repeated 7 times and has 21 elements ahead of it, then cs will occupy the range [21..27] (all indexes are zero-based; the range is inclusive of its ends).
Go through the parent_arr in the order of increasing number of swaps. For each element x, find the beginning of its target range rb; also note the end of its target range re. Now go through the elements of random_arr outside of the [rb..re] range. If you see x, swap it into the range. After swapping, increment rb. If you see that random_arr[rb] is equal to x, continue incrementing: these xs are already in the right spot, you wouldn't need to swap them.
Informal proof of correctness
Now lets prove the correctness of the above. Observe that once an element is swapped into its place, it is never moved again. When you reach an element x in the parent_arr, all elements with lower number of swaps are already processed. By construction of the algorithm this means that these elements are already in place. Suppose that x has k number of allowed swaps. When you swap it into its place, you move another element out.
This replaced element cannot be x, because the algorithm skips xs when looking for a destination in the target range [rb..re]. Moreover, the replaced element cannot be one of elements below x in the parent_arr, because all elements below x are in their places already, and therefore cannot move. This means that the swap count of the replaced element is necessarily k+1 or more. Since by the time that we finish processing x we have exhausted at most k swaps on any element (which is easy to prove by induction), any element that we swap out to make room for x will have at least one remaining swap that would allow us to swap it in place when we get to it in the order dictated by the parent_arr.

Find the middle of unknown size list

In a recent interview I was asked:
Find the middle element of a sorted List of unknown length starting from the first position.
I responded with this:
Have 2 position counters:
counter1
counter2
Increment counter1 by 1 and counter2 by 2. When counter 2 reaches the end of the list counter 1 will be in the middle. I feel that this isn't efficient because I am revisiting nodes I have already seen. Either way, is there a more efficient algorithm?
Assuming a linked list, you can do it while visiting arbitrarily-close-to N of the items.
To do 5/4 N:
Iterate over the list until you hit the end, counting the items.
Drop an anchor at every power of 2'th element. Track the last 2 anchors.
Iterate the before-last anchor until it reaches the middle of the list.
When you hit the end of the list, the before-last anchor is before the mid-point but at least half way there already. So N for the full iteration + at most 1/4 N for the anchor = 5/4 N.
Dropping anchors more frequently, such as at every ceiling power of 1.5^th item, gets you as close to N as needed (at the cost of tracking more anchors; but for any given X as the power step, the asymptotic memory is constant).
I assume you're discussing a linked-list. Indeed your solution is an excellent one. An alternative would be to simply traverse the list counting the number of elements, and then start over from the beginning, traversing half the counted amount. Both methods end up traversing 3n/2 nodes, so there isn't much difference.
It's possible there may be slight cache advantages to either method depending on the architecture; the first method might have the advantage of having cached nodes which could mean quicker retrieval if the cache is large enough before the two pointers are two far apart. Alternatively, the cache might get better blocks if we traverse the list in one go, rather than keeping two pointers alive.
Presuming you can detect that you're beyond the end of the list and seek an arbitrary position efficiently within the list, you could keep doubling the presumed length of the list (guess the length is 1, then 2, then 4, ...) until you're past the end of the list, then use a binary search between the last attempted value less than the length of the list and the first value that exceeded the end of the list to find the actual end of the list.
Then, seek the position END_OF_LIST / 2.
That way, you do not have to visit every node.
Technically, you could do this in one traversal if you used O(N) memory (assuming you're using a linked list).
Traverse the list and convert it into an array (of pointers if this is a linked list).
Return arr[N/2]
Edit: I do love your answer though!
Assuming the regular in-memory linked list that allows reading the next element given the reference to the current multiple times (disclaimer, not tested, but the idea should work):
// assume non-empty list
slow = fast = first;
count = 0;
while (fast)
{
fast = fast.next;
if (!fast)
break;
count++;
fast = fast.next;
if (fast)
count++;
slow = slow.next;
}
if (count % 2)
return slow.data;
else
return (slow.data + slow.next.data)/2.0;
A more difficult case is when the "list" is not a linked list in memory, but rather a stream that you can read in sorted order and what to read each element only once for which I do not have a nice solution.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Resources