Optimal way to find number of unique numbers in an array - algorithm

What is the optimal way to find number of unique numbers in an array. One way is to add them to HashSet and then find the size of hashset. Is there any other way better than this.
I just need the number of unique numbers. Their frequency is not required.
Any help is appreciated.
Thanks,
Harish

What's the tradeoff in memory for fewer cpu cycles you're willing to accept? Which is more important for your optimal solution?
A variant of counting sort is very inefficient in space, but extremely fast.
For larger datasets you'll be wanting to use hashing, which is what hashset already does. Assuming you're willing to take the overhead of it actually storing the data, just go with your idea. It has the added advantage of being simpler to implement in any language with a decent standard library.

You don't say what is known about the numbers, but if 1) they are integers and 2) you know the range (max and min) and 3) the range isn't too large, then you can allocate an array of ints equal in length to ceiling(range / 32) (assuming 32-bit integers) all initialized to zero. Then go through the data set and set the bit corresponding to each number to 1. At the end, just count the number of 1 bits.

One simple algorithm is to loop through the list adding numbers to a hash set as you said, but each time check if it is already in the set, and if not add 1 to a running count. Then when you finish looping through the list you will have the number of distinct elements in the final value of the running count. Here is a python example:
count=0
s=set()
for i in list:
if i not in s:
s.add(i)
count+=1
Edit: I use a running count instead of checking the length of a set because in the background the set may be implemented as a sparse array and an extra loop over that array may be needed to check if each hash has a corresponding value. The running count avoids that potential additional overhead.

I would suggest to sort the array first and look for unique elements after that.

Related

How to quickly search for a specified element in an ordered array consisting of only two types of elements?

The array mentioned in the question are as follows:
[1,1,...,1,1,-1,-1,...,-1,-1]
How to quickly find the index of the 1 closest to -1?
Note: Both 1 and -1 will exist at the same time, and the number of 1 and -1 is large.
For example, for an array like this:
[1,1,1,1,1,-1,-1,-1]
the result should be 4.
The fastest way I can think of is binary search, is there a faster way?
With the current representation of the data, binary search is the fastest way I can thing of. Of course, you can cache and reuse the results in constant time since the answer is always the same.
On the other hand if you change the representation of the array to some simple numbers you can find the next element in constant time. Since the data can always be mapped to a binary value, you can reduce the whole array to 2 numbers. The length of the first partition and the length of the second partition. Or the length of the whole array and the partitioning point. This way you can easily change the length of both partitions in constant time and have access to the next element of the second partition in constant time.
Of course, changing the representation of the array itself is a logarithmic process since you need to find the partitioning point.
By a simple information theoretic argument, you can't be faster than log(n) using only comparisons. Because there are n possible outcomes, and you need to collect at least log(n) bits of information to number them.
If you have extra information about the statistical distribution of the values, then maybe you can exploit it. But this is to be discussed on a case-by-case basis.

Which sorting algorithm should I use for a list with a lot of replications?

I want to sort an array of 1 million integers. What would be the best algorithm to use knowing that the universe of the array's integers are from 1 to 100? Note that this means that there are a lot of items replicated. Furthermore, the array is randomly distributed.
You create an array of 100 elements (with one for each possible value) and simply count how many there are of each. Running time: O(n), with each element of the original array accessed only once, so you're unlikely to find a faster one. :)
Or to give it its proper name, use a counting sort.

sorting a bivalued list

If I have a list of just binary values containing 0's and 1's like the following 000111010110
and I want to sort it to the following 000000111111 what would be the most efficient way to do this if you also know the list size? Right now I am thinking to have one counter where I just count the number of 0's as I traverse the list from beginning to end. Then if I divide the listSize by numberOfZeros I get numberOfOnes. Then I was thinking instead of reordering the list starting with zeros, I would just create a new list. Would you agree this is the most efficient method?
Your algorithm implements the most primitive version of the classic bucket sort algorithm (its counting sort implementation). It is the fastest possible way to sort numbers when their range is known, and is (relatively) small. Since zeros and ones is all you have, you do not need an array of counters that are present in the bucket sort: a single counter is sufficient.
If you have numeric values, you can use the assembly instruction bitscan (BSF in x86 assembly) to count the number of bits. To create the "sorted" value you would set the n+1 bit, then subtract one. This will set all the bits to the right of the n+1 bit.
Bucket sort is a sorting algorithm as it seems.
I dont think there is a need for such operations.As we know there is no Sorting algorithm faster than N*logN . So by default it is wrong.
And all that because all you got to do is what you said in the very beginning.Just traverse the list and count the Zero's or the One's that will give you O(n) complexity.Then just create a new array with the counted zero's in the beginning followed by the One's.Then you have a total of N+N complexity that gives you
O(n) complexity.
And thats only because you have only two values.So neither quick sort or any other sort can do this faster.There is no faster sorting than NLog(n)

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources