Given O(n) sets, what is complexity of figuring out distinct ones amongst them? - algorithm

I have an application where I have a list of O(n) sets.
Each set Set(i) is an n-vector. Suppose n=4, for instance,
Set(1) could be [0|1|1|0]
Set(2) could be [1|1|1|0]
Set(3) could be [1|1|0|0]
Set(4) could be [1|1|1|0]
I'd like to process these sets so that as output, I only get the unique ones amongst them. So, in the example above, I would get as output:
Set(1), Set(2), Set(3). Note that Set(4) is discarded since it is same as Set(2).
A rather brute force way of figuring this gives me a worst-case bound of O(n^3):
Given: Input List of size O(n)
Output List L = Set(1)
for(j = 2 to Length of Input List){ // Loop Outer, check if Set(j) should be added to L
for(i = 1 to Length of L currently){ // Loop Inner
check if Set(i) is same as Set(j) //This step is O(n) since Set() has O(n) elements
if(they are same) exit inner loop
else
if( i is length of L currently) //so, Set(j) is unique thus far
Append Set(j) to L
}
}
There is no a priori bound on n: it can be arbitrarily large. This seems to preclude use of simple hash function which maps the binary set into decimal. I could be wrong.
Is there any other way this can be done in better worst-case running time other than O(n^3)?

O(n) sequences of length n makes an input of size O(n^2). You won't get complexity better than that, since you may at least be required to read all the input. All sequences might be the same, for example, but you'd have to read them all to know that.
A binary sequence of length n can be inserted into a trie or radix tree, while checking whether or not it already exists, in O(n) time. That's O(n^2) for all the sequences together, so simply using a trie or radix tree to find duplicates is optimal.
See: https://en.wikipedia.org/wiki/Trie
and: https://en.wikipedia.org/wiki/Radix_tree

You may consider implementing your set using a balanced binary tree. The cost of inserting a new node into such a tree is O(lgm), where m is the number of elements in the tree. Duplicates would implicitly be weeded out because if we detect that such a node already exists, then it would just not be added.
In your example, the total number of lookup/insertion operations would be n*n, since there are n sets, and each set has n values. So, the overall time might scale as O(n^2*lg(n^2)). This outperforms O(n^3) by some amount.

First of all, these are not sets but bitstrings.
Next, for every bitstring you can convert it to a number and put that number in a hashset (or simply store the original bitstrings, most hashset implementations can do that). Afterwards, your hashset contains all the unique items. O(N) time, O(N) space. If you need to maintain the original order of strings, then in the first loop check for each string if it is in the hashset already, and if not, output it and insert in the hashset.

If you can use O(n) extra space, you can try this:
First of all, let's assume the vectors are binary numbers, so 0110 becomes 6.
This is in case numbers in vectors are [0,1], else you can multiply by 10 instead of 2.
Converting all vectors into decimals would take O(4n).
For each converted number we'll map the vector by the decimal number. To implement this, we'll be using an n-sized hash-map.
HM <- n-sized hash-map
for each vector v:
num <- decimal number converted of v
map v into HM by num
loop over HM and take only one for each index
runtime by steps:
O(n)
O(n*(4+1)) , when 1 is the time for mapping, 4 is the vector length
O(n)

Related

Removing items from a list - algorithm time complexity

Problem consists of two sorted lists with no duplicates of sizes n and m. First list contains strings that should be deleted from second list.
Simplest algorithm would have to do nxm operations (I believe that terminology for this is "quadratic time"?).
Improved solution would be to take advantage of the fact that both list are sorted and skip strings with index that is lower than last deleted index in future comparisons.
I wonder what time complexity would that be?
Are there any solutions for this problem with better time complexity?
You should look into Merge sort. This is the basic idea behind why it works efficiently.
The idea is to scan the two lists together, which takes O(n+m) time:
Make a pointer x for first list, say A and another pointer y for the second list, say B. Set x=0 and y=0. While x < n and y < m, if A[x] < B[y], then add A[x] to the new merged list and increment x. Otherwise add B[y] to the new list and increment y. Once you hit x=n or y=m, take on the remaining elements from B or A, respectively.
I believe the complexity would be O(n+m), because every item in each of the lists would be visited exactly once.
A counting/bucket sort algorithm would work where each string in the second list is a bucket.
You go through the second list (takes m time) and create your buckets. You then go through your first list (takes n time) and increment the number of occurances. You then would have to go through each bucket (takes m time) again and only return strings that occur once. A Trie or a HashMap would work well for storing a buckets. Should be O(n+m+m). If you use a HashSet, in the second pass instead of incrementing a counter, you remove from the Set. It should be O(n+m+(m-n)).
Might it be O(m + log(n)) if binary search is used?

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Interview challenge: Find the different elements in two arrays

Stage 1: Given two arrays, say A[] and B[], how could you find out if elements of B is in A?
Stage 2: What about the size of A[] is 10000000000000... and B[] is much smaller than this?
Stage 3: What about the size of B[] is also 10000000000.....?
My answer is as follows:
Stage 1:
double for loop - O(N^2);
sort A[], then binary search - O(NlgN)
Stage 2:
using bit set, since the integer is 32bits....
Stage 3: ..
Do you have any good ideas?
hash all elements in A [iterate the array and insert the elements into a hash-set], then iterate B, and check for each element if it is in B or not. you can get average run time of O(|A|+|B|).
You cannot get sub-linear complexity, so this solution is optimal for average case analyzis, however, since hashing is not O(1) worst case, you might get bad worst-case performance.
EDIT:
If you don't have enough space to store a hash set of elements in B, you might want to concider a probabilistic solution using bloom filters. The problem: there might be some false positives [but never false negative]. Accuracy of being correct increases as you allocate more space for the bloom filter.
The other solution is as you said, sort, which will be O(nlogn) time, and then use binary search for all elements in B on the sorted array.
For 3rd stage, you get same complexity: O(nlogn) with the same solution, it will take approximately double time then stage 2, but still O(nlogn)
EDIT2:
Note that instead of using a regular hash, sometimes you can use a trie [depands on your elements type], for example: for ints, store the number as it was a string, each digit will be like a character. with this solution, you get O(|B|*num_digits+|A|*num_digits) solution, where num_digits is the number of digits in your numbers [if they are ints]. Assuming num_digits is bounded with a finite size, you get O(|A|+|B|) worst case.
Stage 1: make a hash set from A and iterate over B, checking if current element B[i] exists in A (same way that #amit proposed earlier). Complexity (averaged) - O(length(A) + length(B)).
Stage 2: make a hash set from B, then iterate over A and if current element exists in B, remove it from B. If after iterating B has at least 1 element, then not all B's element exist in A; otherwise A is complete superset of B. Complexity (averaged) - O(length(A) + length(B)).
Stage 3: sort both arrays in-place and iterate, searching for same numbers on current positions i and j for A[i] and B[j] (the idea must be obvious). Complexity - O(n*log n), where n = length(A).

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Resources