How to count unique items in a list? - algorithm

How would someone go on counting the number of unique items in a list?
For example say I have {1, 3, 3, 4, 1, 3} and I want to get the number 3 which represent the number of unique items in the list(namely |A|=3 if A={1, 3, 4}). What algorithm would someone use for this?
I have tryied a double loop:
for firstItem to lastItem
currentItem=a
for currentItem to lastItem
currentItem=b
if a==b then numberOfDublicates++
uniqueItems=numberOfItems-numberOfDublicates
That doesn't work as it counts the duplicates more times than actually needed. With the example in the beginning it would be:
For the first loop it would count +1 duplicates for number 1 in the list.
For the second loop it would count +2 duplicates for number 3 in the list.
For the third loop it would count +1 duplicates for number 3 again(overcounting the last '3') and
there's where the problem comes in.
Any idea on how to solve this?

Add the items to a HashSet, then check the HashSet's size after you finish.
Assuming that you have a good hash function, this is O(n).

You can check to see if there are any duplicates following the number. If not increment the uniqueCount:
uniqueCount = 0;
for (i=0;i<size;i++) {
bool isUnique = true;
for (j=i+1;j<size;j++)
if (arr[i] == arr[j] {
isUnique = false;
break;
}
}
if(isUnique) {
uniqueCount ++;
}
}
The above approach is O(N^2) in time and O(1) in space.
Another approach would be to sort the input array which will put duplicate elements next to each other and then look for adjacent array elements. This approach is O(NlgN) in time and O(1) in space.
If you are allowed to use additional space you can get this done in O(N) time and O(N) space by using a hash. The keys for the hash are the array elements and the values are their frequencies.
At the end of hashing you can get the count of only those hash keys which have value of 1.

Sort it using a decent sorting algorithm like mergesort or heapsort (both habe O(n log n) as worst-case) and loop over the sorted list:
sorted_list = sort(list)
unique_count = 0
last = sorted_list[0]
for item in sorted_list[1:]:
if not item == last:
unique_count += 1
last = item

list.sort();
for (i = 0; i < list.size() - 1; i++)
if (list.get(i)==list.get(i+1)
duplicates++;

Keep Dictionary and add count in loop
This is how it will look at c#
int[] items = {1, 3, 3, 4, 1, 3};
Dictionary<int,int> dic = new Dictionary<int,int>();
foreach(int item in items)
dic[item]++
Of course there is LINQ way in C#, but as I understand question is general ;)

Related

Algorithm for all combinations to divide set into equally sized subsets [duplicate]

Let's say I have a set of elements S = { 1, 2, 3, 4, 5, 6, 7, 8, 9 }
I would like to create combinations of 3 and group them in a way such that no number appears in more than one combination.
Here is an example:
{ {3, 7, 9}, {1, 2, 4}, {5, 6, 8} }
The order of the numbers in the groups does not matter, nor does the order of the groups in the entire example.
In short, I want every possible group combination from every possible combination in the original set, excluding the ones that have a number appearing in multiple groups.
My question: is this actually feasible in terms of run time and memory? My sample sizes could be somewhere around 30-50 numbers.
If so, what is the best way to create this algorithm? Would it be best to create all possible combinations, and choose the groups only if the number hasn't already appeared?
I'm writing this in Qt 5.6, which is a C++ based framework.
You can do this recursively, and avoid duplicates, if you keep the first element fixed in each recursion, and only make groups of 3 with the values in order, eg:
{1,2,3,4,5,6,7,8,9}
Put the lowest element in the first spot (a), and keep it there:
{a,b,c} = {1, *, *}
For the second spot (b), iterate over every value from the second-lowest to the second-highest:
{a,b,c} = {1, 2~8, *}
For the third spot (c), iterate over every value higher than the second value:
{1, 2~8, b+1~9}
Then recurse with the rest of the values.
{1,2,3} {4,5,6} {7,8,9}
{1,2,3} {4,5,7} {6,8,9}
{1,2,3} {4,5,8} {6,7,9}
{1,2,3} {4,5,9} {6,7,8}
{1,2,3} {4,6,7} {5,8,9}
{1,2,3} {4,6,8} {5,7,9}
{1,2,3} {4,6,9} {5,7,8}
{1,2,3} {4,7,8} {5,6,9}
{1,2,3} {4,7,9} {5,6,8}
{1,2,3} {4,8,9} {5,6,7}
{1,2,4} {3,5,6} {7,8,9}
...
{1,8,9} {2,6,7} {3,4,5}
Wen I say "in order", that doesn't have to be any specific order (numerical, alphabetical...), it can just be the original order of the input. You can avoid having to re-sort the input of each recursion if you make sure to pass the rest of the values on to the next recursion in the order you received them.
A run-through of the recursion:
Let's say you get the input {1,2,3,4,5,6,7,8,9}. As the first element in the group, you take the first element from the input, and for the other two elements, you iterate over the other values:
{1,2,3}
{1,2,4}
{1,2,5}
{1,2,6}
{1,2,7}
{1,2,8}
{1,2,9}
{1,3,4}
{1,3,5}
{1,3,6}
...
{1,8,9}
making sure the third element always comes after the second element, to avoid duplicates like:
{1,3,5} &lrarr; {1,5,3}
Now, let's say that at a certain point, you've selected this as the first group:
{1,3,7}
You then pass the rest of the values onto the next recursion:
{2,4,5,6,8,9}
In this recursion, you apply the same rules as for the first group: take the first element as the first element in the group and keep it there, and iterate over the other values for the second and third element:
{2,4,5}
{2,4,6}
{2,4,8}
{2,4,9}
{2,5,6}
{2,5,8}
{2,5,9}
{2,6,7}
...
{2,8,9}
Now, let's say that at a certain point, you've selected this as the second group:
{2,5,6}
You then pass the rest of the values onto the next recursion:
{4,8,9}
And since this is the last group, there is only one possibility, and so this particular recursion would end in the combination:
{1,3,7} {2,5,6} {4,8,9}
As you see, you don't have to sort the values at any point, as long as you pass them onto the next recursion in the order you recevied them. So if you receive e.g.:
{q,w,e,r,t,y,u,i,o}
and you select from this the group:
{q,r,u}
then you should pass on:
{w,e,t,y,i,o}
Here's a JavaScript snippet which demonstrates the method; it returns a 3D array with combinations of groups of elements.
(The filter function creates a copy of the input array, with elements 0, i and j removed.)
function clone2D(array) {
var clone = [];
for (var i = 0; i < array.length; i++) clone.push(array[i].slice());
return clone;
}
function groupThree(input) {
var result = [], combination = [];
group(input, 0);
return result;
function group(input, step) {
combination[step] = [input[0]];
for (var i = 1; i < input.length - 1; i++) {
combination[step][1] = input[i];
for (var j = i + 1; j < input.length; j++) {
combination[step][2] = input[j];
if (input.length > 3) {
var rest = input.filter(function(elem, index) {
return index && index != i && index != j;
});
group(rest, step + 1);
}
else result.push(clone2D(combination));
}
}
}
}
var result = groupThree([1,2,3,4,5,6,7,8,9]);
for (var r in result) document.write(JSON.stringify(result[r]) + "<br>");
For n things taken 3 at a time, you could use 3 nested loops:
for(k = 0; k < n-2; k++){
for(j = k+1; j < n-1; j++){
for(i = j+1; i < n ; i++){
... S[k] ... S[j] ... S[i]
}
}
}
For a generic solution of n things taken k at a time, you could use an array of k counters.
I think You can solve it by using coin change problem with dynamic programming, just assume You are looking for change of 3 and every index in array is a coin value 1, then just output coins(values in Your array) that has been found.
Link: https://www.youtube.com/watch?v=18NVyOI_690

(Any Language) Find all permutations of elements in a vector using swapping

I was asked this question in a Lab session today.
We can imagine a vector containing the elements 1 ... N - 1, with a length N. Is there an algorithmic (systematic) method of generating all permutations, or orders of the elements in the vector. One proposed method was to swap random elements. Obviously this would work provided all previously generated permutations were stored for future reference, however this is obviously a very inefficient method, both space wise and time wise.
The reason for doing this by the way is to remove special elements (eg elements which are zero) from special positions in the vector, where such an element is not allowed. Therefore the random method isn't quite so ridiculous, but imagine the case where the number of elements is large and the number of possible permutations (which are such that there are no "special elements" in any of the "special positions") is low.
We tried to work through this problem for the case of N = 5:
x = [1, 2, 3, 4, 5]
First, swap elements 4 and 5:
x = [1, 2, 3, 5, 4]
Then swap 3 and 5:
x = [1, 2, 4, 5, 3]
Then 3 and 4:
x = [1, 2, 5, 4, 3]
Originally we thought using two indices, ix and jx, might be a possible solution. Something like:
ix = 0;
jx = 0;
for(;;)
{
++ ix;
if(ix >= N)
{
ix = 0;
++ jx;
if(jx >= N)
{
break; // We have got to an exit condition, but HAVENT got all permutations
}
}
swap elements at positions ix and jx
print out the elements
}
This works for the case where N = 3. However it doesn't work for higher N. We think that this sort of approach might be along the right lines. We were trying to extend to a method where 3 indexes are used, for some reason we think that might be the solution: Using a 3rd index to mark a position in the vector where the index ix starts or ends. But we got stuck, and decided to ask the SO community for advice.
One way to do this is to, for the first character e:
First recurse on the next element
Then, for each element e2 after e:
Swap e and e2
Then recurse on the next element
And undo the swap
Pseudo-code:
permutation(input, 0)
permutation(char[] array, int start)
if (start == array.length)
print array
for (int i = start; i < array.length; i++)
swap(array[start], array[i])
permutation(array, start+1)
swap(array[start], array[i])
With the main call of this function, it will try each character in the first position and then recurse. Simply looping over all the characters works here because we undo each swap afterwards, so after the recursive call returns, we're guaranteed to be back where we started.
And then, for each of those recursive calls, it tries each remaining character in the second position. And so on.
Java live demo.

Unique representation of 2 or more arrays

I have got several arrays of fixed length where each component can take on natural number values. In my program 2 vectors are identical in this simple case
0001112
1110002
2220001 would also be identical to these 2 arrays
My question is how can I get a unique representation for these two arrays?
Cheers
It's not entirely clear how your equivalence relation is defined, but building a set representation out of the arrays satisfies the constraint you've given. There are two ways of doing this:
Convert to an appropriate data structure (sets are built-in in many languages, otherwise a hash table or BST will do).
Sort each array, remove the duplicate elements and truncate them. Since they're fixed-length, you'll have to store the number of distinct elements somewhere, or use -1 to signal "end of elements".
One way is to store them in a dictionary (hash table) mapping each number to the number of times it appears. Your two arrays would have the same representation:
{0: 3, 1: 3, 2: 1}
public static List<int> GetUniqueRepresentation(int[] array)
{
int count = 1;
var output = new List<int>();
for (int i = 1; i <= array.Length; i++)
{
if (i < array.Length && array[i] == array[i - 1])
{
count++;
}
else
{
output.Add(count);
count = 1;
}
}
return output;
}

Programming Interview Question / how to find if any two integers in an array sum to zero?

Not a homework question, but a possible interview question...
Given an array of integers, write an algorithm that will check if the sum of any two is zero.
What is the Big O of this solution?
Looking for non brute force methods
Use a lookup table: Scan through the array, inserting all positive values into the table. If you encounter a negative value of the same magnitude (which you can easily lookup in the table); the sum of them will be zero. The lookup table can be a hashtable to conserve memory.
This solution should be O(N).
Pseudo code:
var table = new HashSet<int>();
var array = // your int array
foreach(int n in array)
{
if ( !table.Contains(n) )
table.Add(n);
if ( table.Contains(n*-1) )
// You found it.;
}
The hashtable solution others have mentioned is usually O(n), but it can also degenerate to O(n^2) in theory.
Here's a Theta(n log n) solution that never degenerates:
Sort the array (optimal quicksort, heap sort, merge sort are all Theta(n log n))
for i = 1, array.len - 1
binary search for -array[i] in i+1, array.len
If your binary search ever returns true, then you can stop the algorithm and you have a solution.
An O(n log n) solution (i.e., the sort) would be to sort all the data values then run a pointer from lowest to highest at the same time you run a pointer from highest to lowest:
def findmatch(array n):
lo = first_index_of(n)
hi = last_index_of(n)
while true:
if lo >= hi: # Catch where pointers have met.
return false
if n[lo] = -n[hi]: # Catch the match.
return true
if sign(n[lo]) = sign(n[hi]): # Catch where pointers are now same sign.
return false
if -n[lo] > n[hi]: # Move relevant pointer.
lo = lo + 1
else:
hi = hi - 1
An O(n) time complexity solution is to maintain an array of all values met:
def findmatch(array n):
maxval = maximum_value_in(n) # This is O(n).
array b = new array(0..maxval) # This is O(1).
zero_all(b) # This is O(n).
for i in index(n): # This is O(n).
if n[i] = 0:
if b[0] = 1:
return true
b[0] = 1
nextfor
if n[i] < 0:
if -n[i] <= maxval:
if b[-n[i]] = 1:
return true;
b[-n[i]] = -1
nextfor
if b[n[i]] = -1:
return true;
b[n[i]] = 1
This works by simply maintaining a sign for a given magnitude, every possible magnitude between 0 and the maximum value.
So, if at any point we find -12, we set b[12] to -1. Then later, if we find 12, we know we have a pair. Same for finding the positive first except we set the sign to 1. If we find two -12's in a row, that still sets b[12] to -1, waiting for a 12 to offset it.
The only special cases in this code are:
0 is treated specially since we need to detect it despite its somewhat strange properties in this algorithm (I treat it specially so as to not complicate the positive and negative cases).
low negative values whose magnitude is higher than the highest positive value can be safely ignored since no match is possible.
As with most tricky "minimise-time-complexity" algorithms, this one has a trade-off in that it may have a higher space complexity (such as when there's only one element in the array that happens to be positive two billion).
In that case, you would probably revert to the sorting O(n log n) solution but, if you know the limits up front (say if you're restricting the integers to the range [-100,100]), this can be a powerful optimisation.
In retrospect, perhaps a cleaner-looking solution may have been:
def findmatch(array num):
# Array empty means no match possible.
if num.size = 0:
return false
# Find biggest value, no match possible if empty.
max_positive = num[0]
for i = 1 to num.size - 1:
if num[i] > max_positive:
max_positive = num[i]
if max_positive < 0:
return false
# Create and init array of positives.
array found = new array[max_positive+1]
for i = 1 to found.size - 1:
found[i] = false
zero_found = false
# Check every value.
for i = 0 to num.size - 1:
# More than one zero means match is found.
if num[i] = 0:
if zero_found:
return true
zero_found = true
# Otherwise store fact that you found positive.
if num[i] > 0:
found[num[i]] = true
# Check every value again.
for i = 0 to num.size - 1:
# If negative and within positive range and positive was found, it's a match.
if num[i] < 0 and -num[i] <= max_positive:
if found[-num[i]]:
return true
# No matches found, return false.
return false
This makes one full pass and a partial pass (or full on no match) whereas the original made the partial pass only but I think it's easier to read and only needs one bit per number (positive found or not found) rather than two (none, positive or negative found). In any case, it's still very much O(n) time complexity.
I think IVlad's answer is probably what you're after, but here's a slightly more off the wall approach.
If the integers are likely to be small and memory is not a constraint, then you can use a BitArray collection. This is a .NET class in System.Collections, though Microsoft's C++ has a bitset equivalent.
The BitArray class allocates a lump of memory, and fills it with zeroes. You can then 'get' and 'set' bits at a designated index, so you could call myBitArray.Set(18, true), which would set the bit at index 18 in the memory block (which then reads something like 00000000, 00000000, 00100000). The operation to set a bit is an O(1) operation.
So, assuming a 32 bit integer scope, and 1Gb of spare memory, you could do the following approach:
BitArray myPositives = new BitArray(int.MaxValue);
BitArray myNegatives = new BitArray(int.MaxValue);
bool pairIsFound = false;
for each (int testValue in arrayOfIntegers)
{
if (testValue < 0)
{
// -ve number - have we seen the +ve yet?
if (myPositives.get(-testValue))
{
pairIsFound = true;
break;
}
// Not seen the +ve, so log that we've seen the -ve.
myNegatives.set(-testValue, true);
}
else
{
// +ve number (inc. zero). Have we seen the -ve yet?
if (myNegatives.get(testValue))
{
pairIsFound = true;
break;
}
// Not seen the -ve, so log that we've seen the +ve.
myPositives.set(testValue, true);
if (testValue == 0)
{
myNegatives.set(0, true);
}
}
}
// query setting of pairIsFound to see if a pair totals to zero.
Now I'm no statistician, but I think this is an O(n) algorithm. There is no sorting required, and the longest duration scenario is when no pairs exist and the whole integer array is iterated through.
Well - it's different, but I think it's the fastest solution posted so far.
Comments?
Maybe stick each number in a hash table, and if you see a negative one check for a collision? O(n). Are you sure the question isn't to find if ANY sum of elements in the array is equal to 0?
Given a sorted array you can find number pairs (-n and +n) by using two pointers:
the first pointer moves forward (over the negative numbers),
the second pointer moves backwards (over the positive numbers),
depending on the values the pointers point at you move one of the pointers (the one where the absolute value is larger)
you stop as soon as the pointers meet or one passed 0
same values (one negative, one possitive or both null) are a match.
Now, this is O(n), but sorting (if neccessary) is O(n*log(n)).
EDIT: example code (C#)
// sorted array
var numbers = new[]
{
-5, -3, -1, 0, 0, 0, 1, 2, 4, 5, 7, 10 , 12
};
var npointer = 0; // pointer to negative numbers
var ppointer = numbers.Length - 1; // pointer to positive numbers
while( npointer < ppointer )
{
var nnumber = numbers[npointer];
var pnumber = numbers[ppointer];
// each pointer scans only its number range (neg or pos)
if( nnumber > 0 || pnumber < 0 )
{
break;
}
// Do we have a match?
if( nnumber + pnumber == 0 )
{
Debug.WriteLine( nnumber + " + " + pnumber );
}
// Adjust one pointer
if( -nnumber > pnumber )
{
npointer++;
}
else
{
ppointer--;
}
}
Interesting: we have 0, 0, 0 in the array. The algorithm will output two pairs. But in fact there are three pairs ... we need more specification what exactly should be output.
Here's a nice mathematical way to do it: Keep in mind all prime numbers (i.e. construct an array prime[0 .. max(array)], where n is the length of the input array, so that prime[i] stands for the i-th prime.
counter = 1
for i in inputarray:
if (i >= 0):
counter = counter * prime[i]
for i in inputarray:
if (i <= 0):
if (counter % prime[-i] == 0):
return "found"
return "not found"
However, the problem when it comes to implementation is that storing/multiplying prime numbers is in a traditional model just O(1), but if the array (i.e. n) is large enough, this model is inapropriate.
However, it is a theoretic algorithm that does the job.
Here's a slight variation on IVlad's solution which I think is conceptually simpler, and also n log n but with fewer comparisons. The general idea is to start on both ends of the sorted array, and march the indices towards each other. At each step, only move the index whose array value is further from 0 -- in only Theta(n) comparisons, you'll know the answer.
sort the array (n log n)
loop, starting with i=0, j=n-1
if a[i] == -a[j], then stop:
if a[i] != 0 or i != j, report success, else failure
if i >= j, then stop: report failure
if abs(a[i]) > abs(a[j]) then i++ else j--
(Yeah, probably a bunch of corner cases in here I didn't think about. You can thank that pint of homebrew for that.)
e.g.,
[ -4, -3, -1, 0, 1, 2 ] notes:
^i ^j a[i]!=a[j], i<j, abs(a[i])>abs(a[j])
^i ^j a[i]!=a[j], i<j, abs(a[i])>abs(a[j])
^i ^j a[i]!=a[j], i<j, abs(a[i])<abs(a[j])
^i ^j a[i]==a[j] -> done
The sum of two integers can only be zero if one is the negative of the other, like 7 and -7, or 2 and -2.

Algorithm get a new list containing no duplicated item by adding any 2 elements in a big array

I can only think of this naive algorithm. Any better way? C/C++, Ruby ,Haskell is OK.
arry = [1,5,.....4569895] //1000000 elements ,sorted , no duplicated
newArray = Hash.new
for (i = 0 ; i < arry.length ;i++ )
{
for (j = 0 ; j < arry.length ;j ++ )
{
elem = arry[i] + arry[j]
if (! newArray.key?(elem))
{
newArray [elem] = arry[i] + arry[j]
}
}
}
EDIT : sorry. I have discrete value in the array , instead of [1..1000000]
It would be more efficient to separate the algorithm into two distinct steps. (Warning: pseudocode ahead)
First create n-1 lists by adding the rest of the elements to the ith element. This can be done in parallel for each list. Note that the resulting lists will be sorted.
newArray = array(array.length);
for (i = 0 ; i < array.length ;i++ ) {
newArray[i] = array(array.length - i - 1);
for (j = 0; j < array.length - i; j++) {
newArray[i][j] = array[i] + array[j + i];
}
}
Second use merge sort in to merge the resulted lists. You can do this in parallel, e.g. merge newArray[0] - newArray[i], newArray[2] - newArray[1-i], ... and then again until you only have one list.
If the condition says that you should be able to add any item in the range, then the only way i can think of is to check if the sum is not yet in the result list. Since for any number x, there are x different additions that lead to x. (Or x/2 if you think that 1 + 2 and 2 + 1 is the same addition).
There is one obvious optimization: make the second loop start at the indice i, that way you will avoid having x+y and y+x.
Then if you don't want to use a set, you could use the fact that the items are sorted, so you could build N lists, and merge them while removing the duplicates.
I'm afraid the best worst-case time complexity is O(n2). For input {20, 21, 22, ...}, you won't get any duplicate adding these numbers. Assuming hash insertions are O(1), you already have the best algorithm...

Resources