Find the N-th most frequent number in the array - algorithm

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?

It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)

Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.

It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).

Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Related

Finding tuple with maximum difference between its minimum and maximum first element

Given an array of elements of the form (a,b) where a is an integer and b is a string.
The array is sorted in terms of a the first element. We have to find the string b
which has the maximum difference between it's lowest a and highest a.
My Thoughts :
A simple approach is to hash each string to a HashTable ensuring that no two same strings
map to the same hashtable. Now consider any bucket for a string b we need to store only
two two elements in that bucket one the max a encountered till now and one the min a encountered
till now. Once the hashtable is populated we simple have to iterate over all strings and
find the one with the maximum difference.
Now this could run in O(N) time
But the only questionable assumption here is that the strings will go into a different bucket
That cannot be guaranted simply by any implementation of HashTable while maintaining the
average time complexity of insert, search and delete to be Theta(1)

Given O(n) sets, what is complexity of figuring out distinct ones amongst them?

I have an application where I have a list of O(n) sets.
Each set Set(i) is an n-vector. Suppose n=4, for instance,
Set(1) could be [0|1|1|0]
Set(2) could be [1|1|1|0]
Set(3) could be [1|1|0|0]
Set(4) could be [1|1|1|0]
I'd like to process these sets so that as output, I only get the unique ones amongst them. So, in the example above, I would get as output:
Set(1), Set(2), Set(3). Note that Set(4) is discarded since it is same as Set(2).
A rather brute force way of figuring this gives me a worst-case bound of O(n^3):
Given: Input List of size O(n)
Output List L = Set(1)
for(j = 2 to Length of Input List){ // Loop Outer, check if Set(j) should be added to L
for(i = 1 to Length of L currently){ // Loop Inner
check if Set(i) is same as Set(j) //This step is O(n) since Set() has O(n) elements
if(they are same) exit inner loop
else
if( i is length of L currently) //so, Set(j) is unique thus far
Append Set(j) to L
}
}
There is no a priori bound on n: it can be arbitrarily large. This seems to preclude use of simple hash function which maps the binary set into decimal. I could be wrong.
Is there any other way this can be done in better worst-case running time other than O(n^3)?
O(n) sequences of length n makes an input of size O(n^2). You won't get complexity better than that, since you may at least be required to read all the input. All sequences might be the same, for example, but you'd have to read them all to know that.
A binary sequence of length n can be inserted into a trie or radix tree, while checking whether or not it already exists, in O(n) time. That's O(n^2) for all the sequences together, so simply using a trie or radix tree to find duplicates is optimal.
See: https://en.wikipedia.org/wiki/Trie
and: https://en.wikipedia.org/wiki/Radix_tree
You may consider implementing your set using a balanced binary tree. The cost of inserting a new node into such a tree is O(lgm), where m is the number of elements in the tree. Duplicates would implicitly be weeded out because if we detect that such a node already exists, then it would just not be added.
In your example, the total number of lookup/insertion operations would be n*n, since there are n sets, and each set has n values. So, the overall time might scale as O(n^2*lg(n^2)). This outperforms O(n^3) by some amount.
First of all, these are not sets but bitstrings.
Next, for every bitstring you can convert it to a number and put that number in a hashset (or simply store the original bitstrings, most hashset implementations can do that). Afterwards, your hashset contains all the unique items. O(N) time, O(N) space. If you need to maintain the original order of strings, then in the first loop check for each string if it is in the hashset already, and if not, output it and insert in the hashset.
If you can use O(n) extra space, you can try this:
First of all, let's assume the vectors are binary numbers, so 0110 becomes 6.
This is in case numbers in vectors are [0,1], else you can multiply by 10 instead of 2.
Converting all vectors into decimals would take O(4n).
For each converted number we'll map the vector by the decimal number. To implement this, we'll be using an n-sized hash-map.
HM <- n-sized hash-map
for each vector v:
num <- decimal number converted of v
map v into HM by num
loop over HM and take only one for each index
runtime by steps:
O(n)
O(n*(4+1)) , when 1 is the time for mapping, 4 is the vector length
O(n)

Most Efficient way to compute the 99th percentile of a data set

I have 100 Integers in my database.
I sort them in ascending order.
Right now for the 99th percentile I am taking the 99th number after sorting.
after a given time t, a new number come into the database and an older number gets discarded.
The current code just take the 100 integer and sort them all over again.
Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?
PS:All this is done under MySQL database
Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).
The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).
Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].
Time complexity: O(N-K) per request.
(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.
If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".
If your data are random distributed you could try guessing the position by assuming a linear distribution.
guessPosition = newnumber*(max-min)/100
And then make a gallop search from that point out.
And when found insert it at the correct position.
So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

sum property of consecutive numbers

Suppose we have a list of numbers like [6,5,4,7,3]. How can we tell that the array contains consecutive numbers? One way is ofcourse to sort them or we can find the minimum and maximum. But can we determine based on the sum of the elements ? E.g. in the example above, it is 25. Could anyone help me with this?
The sum of elements by itself is not enough.
Instead you could check for:
All elements being unique.
and either:
Difference between min and max being right
or
Sum of all elements being right.
Approach 1
Sort the list and check the first element and last element.
In general this is O( n log(n) ), but if you have a limited data set you can sort in O( n ) time using counting sort or radix sort.
Approach 2
Pass over the data to get the highest and lowest elements.
As you pass through, add each element into a hash table and see if that element has now been added twice. This is more or less O( n ).
Approach 3
To save storage space (hash table), use an approximate approach.
Pass over the data to get the highest and lowest elements.
As you do, implement an algorithm which will with high (read User defined) probability determine that each element is distinct. Many such algorithms exist, and are in use in Data Mining. Here's a link to a paper describing different approaches.
The numbers in the array would be consecutive if the difference between the max and the minimum number of the array is equal to n-1 provided numbers are unique ( where n is the size of the array ). And ofcourse minimum and maximum number can be calculated in O(n).

Determining if a sequence T is a sorting of a sequence S in O(n) time

I know that one can easily determine if a sequence is sorted in O(n) time. However, how can we insure that some sequence T is indeed the sorting of elements from sequence S in O(n) time?
That is, someone might have an algorithm that outputs some sequence T that is indeed in sorted order, but may not contain any elements from sequence S, so how can we check that T is indeed a sorted sequence of S in O(n) time?
Get the length L of S.
Check the length of T as well. If they differ, you are done!
Let Hs be a hash map with something like 2L buckets of all elements in S.
Let Ht be a hash map (again, with 2L buckets) of all elements in T.
For each element in T, check that it exists in Hs.
For each element in S, check that it exists in Ht.
This will work if the elements are unique in each sequence. See wcdolphin's answer for the small changes needed to make it work with non-unique sequences.
I have NOT taken memory consumption into account. Creating two hashmap of double the size of each sequence may be expensive. This is the usual tradeoff between speed and memory.
While Emil's answer is very good, you can do slightly better.
Fundamentally, in order for T to be a reordering of S it must contain all of the same elements. That is to say, for every element in T or S, they must occur the same number of times. Thus, we will:
Create a Hash table of all elements in S, mapping from the 'Element' to the number of occurrences.
Iterate through every element in T, decrementing the number of times the current element occurred.
If the number of occurrences is zero, remove it from the hash.
If the current element is not in the hash, T is not a reordering of S.
Create a hash map of both sequences. Use the character as key, and the count of the character as value. If a character has not been added yet add it with a count of 1. If a character has already been added increase its count by 1.
Verify that for each character in the input sequence that the hash map of the sorted sequence contains the character as key and has the same count as value.
I believe it this is a O(n^2) problem because:
Assuming data structure you use to store elements is a linked list for minimal operations of removing an element
You will be doing a S.contains(element of T) for every element of T, and one to check they are the same size.
You cannot assume that s is ordered and therefore need to do a element by element comparison for every element.
worst case would be if S is reverse of T
This would mean that for element (0+x) of T you would do (n-x) comparisons if you remove each successful element.
this results in (n*(n+1))/2 operations which is O(n^2)
Might be some other cleverer algorithm out there though

Resources