Number of different elements in an array - algorithm

Is it possible to compute the number of different elements in an array in linear time and constant space? Let us say it's an array of long integers, and you can not allocate an array of length sizeof(long).
P.S. Not homework, just curious. I've got a book that sort of implies that it is possible.

This is the Element uniqueness problem, for which the lower bound is Ω( n log n ), for comparison-based models. The obvious hashing or bucket sorting solution all requires linear space too, so I'm not sure this is possible.

You can't use constant space. You can use O(number of different elements) space; that's what a HashSet does.

You can use any sorting algorithm and count the number of different adjacent elements in the array.

I do not think this can be done in linear time. One algorithm to solve in O(n log n) requires first sorting the array (then the comparisons become trivial).

If you are guaranteed that the numbers in the array are bounded above and below, by say a and b, then you could allocate an array of size b - a, and use it to keep track of which numbers have been seen.
i.e., you would move through your input array take each number, and mark a true in your target array at that spot. You would increment a counter of distinct numbers only when you encounter a number whose position in your storage array is false.

Assuming we can partially destroy the input, here's an algorithm for n words of O(log n) bits.
Find the element of order sqrt(n) via linear-time selection. Partition the array using this element as a pivot (O(n)). Using brute force, count the number of different elements in the partition of length sqrt(n). (This is O(sqrt(n)^2) = O(n).) Now use an in-place radix sort on the rest, where each "digit" is log(sqrt(n)) = log(n)/2 bits and we use the first partition to store the digit counts.
If you consider streaming algorithms only ( http://en.wikipedia.org/wiki/Streaming_algorithm ), then it's impossible to get an exact answer with o(n) bits of storage via a communication complexity lower bound ( http://en.wikipedia.org/wiki/Communication_complexity ), but possible to approximate the answer using randomness and little space (Alon, Matias, and Szegedy).

This can be done with a bucket approach when assuming that there are only a constant number of different values. Make a flag for each value (still constant space). Traverse the list and flag the occured values. If you happen to flag an already flagged value, you've found a duplicate. You have to traverse the buckets for each element in the list. But that's still linear time.

Related

Why we can not apply counting sort to general arrays?

Counting sort is known with linear time if we know that all elements in the array are upper bounded by a given number. If we take a general array, cant we just scan the array in linear time, to find the maximum value in the array and then to apply counting sort?
It is not enough to know the upper bound to run a counting sort: you need to have enough memory to fit all the counters.
Consider a situation when you go through an array of 64-bit integers, and find out that the largest element is 2^60. This would mean two things:
You need an O(2^60) memory, and
It is going to take O(2^60) to complete the sort.
The fact that O(2^60) is the same as O(1) is of little help here, because the constant factor is simply too large. This is very often a problem with pseudo-polynomial time algorithms.
Suppose the largest number is like 235684121.
Then you'll spend incredible amounts of RAM to keep your buckets.
I would like to mention something with #dasblinkenlight and #AlbinSunnanbo answers, your idea to scan the array in O(n) pass, to find the maximum value in the array is okay. Below is given from Wikipedia:
However, if the value of k is not already known then it may be
computed by an additional loop over the data to determine the maximum
key value that actually occurs within the data.
As the time complexity is O(n + k) and k should be under a certain limit, your found k should be small. As #dasblinkenlight mentioned, O(large_value) can't practically be converged to O(1).
Though I don't know about any major applications of Counting sort so far except used as a subroutine of Radix Sort, it can be nicely used in problems like string sorting( i.e. sort "android" to "addnoir") as here k is only 255.

data structure similar to array but supporting deletion

I am thinking of the following data structure question:
given integers between 1 and n in sorted order, every operation queries and then removes (in a single call) kth smallest number. How to make the query and removal both constant time operations?
It is similar to an array structure but requiring constant removing. Though an order balanced binary tree can do this, but it is O(lg n) complexity.
Can one take the advantage of the range property (numbers only between 1 and n) to make it work?
LinkedHashSet is what you are looking for . If you want index as in arrays then use this LinkedHashMap. But you need to insert them in order from 1 ton
What is the maximal value of N? You mentioned that you are going to work with positive numbers - Van Emde Boas tree probably the best choice for you.
Short description:
- allows to store only positive numbers from [0,2^k), where k is is a number of bits required to store maximal number N. - all operations (insert,delete,lookup,find_next,find_prev) works in log(K).Not log(N). So, for integer 32-bit numbers complexity is log(32)=5
- disadvantage is memory consumption. requires 2^k ~ O(N) memory, so for storing integers you need ~1GB RAM. Remember, that usually O(N) memory means O(number of elements) but here it means O(maximal stored value).
Note: I'm not sure about supporting k-th element query but description looks nice:
FindNext: find the key/value pair with the smallest key at least a
given k
FindPrevious: find the key/value pair with the largest key at most a
given k
UPDATE
As Dukeling mentioned below, K-th element query is not supported. I see the only way to implement it.
int x = getMin();
for(int i=0;i<k-1;i++) x = getNext(x);
after this loop x will store k-th element. But complexity is O(K*log(bits)). Too bad for large values of K(

Finding number of pairs of integers differing by a value

If we have an array of integers, then is there any efficient way other than O(n^2) by which one can find the number of pairs of integers which differ by a given value?
E.g for the array 4,2,6,7 the number of pairs of integers differing by 2 is 2 {(2,4),(4,6)}.
Thanks.
Create a set from your list. Create another set which has all the elements incremented by the delta. Intersect the two sets. These are the upper values of your pairs.
In Python:
>>> s = [4,2,6,7]
>>> d = 2
>>> s0 = set(s)
>>> sd = set(x+d for x in s0)
>>> set((x-d, x) for x in (s0 & sd))
set([(2, 4), (4, 6)])
Creating the sets is O(n). Intersecting the sets is also O(n), so this is a linear-time algorithm.
Store the elements in a multiset, implemented by a hash table. Then for each element n, check the number of occurences of n-2 in the multiset and sum them up. There is no need to check n+2 because that would cause you to count each pair twice.
The time efficiency is O(n) in the average case, and O(n*logn) or O(n^2) in the worst case (depending on the hash table implementation). It will be O(n*logn) if the multiset is implemented by a balanced tree.
Sort the array, then scan through with two pointers. Supposing the first one points to a, then step the second one forward until you've found where a+2 would be if it was present. Increment the total if it's there. Then increment the first pointer and repeat. At each step, the second pointer starts from the place it ended up on the previous step.
If duplicates are allowed in the array, then you need to remember how many duplicates the second one stepped over, so that you can add this number to the total if incrementing the first pointer yields the same integer again.
This is O(n log n) worst case (for the sort), since the scan is linear time.
It's O(n) worst case on the same basis that hashtable-based solutions for fixed-width integers can say that they're expected O(n) time, since sorting fixed-width integers can be done using radix sort in O(n). Which is actually faster is another matter -- hashtables are fast but might involve a lot of memory allocation (for nodes) and/or badly-localized memory access, depending on implementation.
Note that if the desired difference is 0 and all the elements in the array are identical, then the size of the output is O(n²), so the worst-case of any algorithm is necessarily O(n²). (On the other hand, average-case or expected-case behavior can be significantly better, as others have noted.)
Just hash the numbers in an array as you do in counting sort.Then take two variables, first pointing to index 0 and the other pointing to index 2(or index d in general case) initially.
Now check whether value at both indices are non-zero, if yes then increment the counter with larger of the two values else leave the counter unchanged as the pair does not exist. Now increment both the indices and continue until the second index reaches the end of the array.The total value of counter is the number of pairs with difference d.
Time complexity: O(n)
Space complexity: O(n)

Is it possible to find two numbers whose difference is minimum in O(n) time

Given an unsorted integer array, and without making any assumptions on
the numbers in the array:
Is it possible to find two numbers whose
difference is minimum in O(n) time?
Edit: Difference between two numbers a, b is defined as abs(a-b)
Find smallest and largest element in the list. The difference smallest-largest will be minimum.
If you're looking for nonnegative difference, then this is of course at least as hard as checking if the array has two same elements. This is called element uniqueness problem and without any additional assumptions (like limiting size of integers, allowing other operations than comparison) requires >= n log n time. It is the 1-dimensional case of finding the closest pair of points.
I don't think you can to it in O(n). The best I can come up with off the top of my head is to sort them (which is O(n * log n)) and find the minimum difference of adjacent pairs in the sorted list (which adds another O(n)).
I think it is possible. The secret is that you don't actually have to sort the list, you just need to create a tally of which numbers exist. This may count as "making an assumption" from an algorithmic perspective, but not from a practical perspective. We know the ints are bounded by a min and a max.
So, create an array of 2 bit elements, 1 pair for each int from INT_MIN to INT_MAX inclusive, set all of them to 00.
Iterate through the entire list of numbers. For each number in the list, if the corresponding 2 bits are 00 set them to 01. If they're 01 set them to 10. Otherwise ignore. This is obviously O(n).
Next, if any of the 2 bits is set to 10, that is your answer. The minimum distance is 0 because the list contains a repeated number. If not, scan through the list and find the minimum distance. Many people have already pointed out there are simple O(n) algorithms for this.
So O(n) + O(n) = O(n).
Edit: responding to comments.
Interesting points. I think you could achieve the same results without making any assumptions by finding the min/max of the list first and using a sparse array ranging from min to max to hold the data. Takes care of the INT_MIN/MAX assumption, the space complexity and the O(m) time complexity of scanning the array.
The best I can think of is to counting sort the array (possibly combining equal values) and then do the sorted comparisons -- bin sort is O(n + M) (M being the number of distinct values). This has a heavy memory requirement, however. Some form of bucket or radix sort would be intermediate in time and more efficient in space.
Sort the list with radixsort (which is O(n) for integers), then iterate and keep track of the smallest distance so far.
(I assume your integer is a fixed-bit type. If they can hold arbitrarily large mathematical integers, radixsort will be O(n log n) as well.)
It seems to be possible to sort unbounded set of integers in O(n*sqrt(log(log(n))) time. After sorting it is of course trivial to find the minimal difference in linear time.
But I can't think of any algorithm to make it faster than this.
No, not without making assumptions about the numbers/ordering.
It would be possible given a sorted list though.
I think the answer is no and the proof is similar to the proof that you can not sort faster than n lg n: you have to compare all of the elements, i.e create a comparison tree, which implies omega(n lg n) algorithm.
EDIT. OK, if you really want to argue, then the question does not say whether it should be a Turing machine or not. With quantum computers, you can do it in linear time :)

Is this implementation of Bucket-Sort considered "in-place"?

Consider the following implementation of bucket-sort:
Algorithm BucketSort(S)
input: Sequence S of items with integer keys in range [0,N-1]
output: Sequence S sorted in nondecreasing order of keys.
let B be an array of N sequences, each of which is initially empty
for each item x in S do
let k be the key of x
remove x from S and insert it at the end of bucket (sequence) B[k].
for i←0 to N-1 do
for each item x in sequence B[i] do
remove x from B[i] and insert it at the end of S.
Is this implementation considered "In-Place"?
My textbook gives the following definition for "In-Place":
Remember that a sorting algorithm is
in-place if it uses only a constant
amount of memory in addition to that
needed for the objects being sorted.
Now, I know the above algorithm uses O(n+N) memory, where N is the upper bound on the range. However, and I may be wrong, I think N would be a constant, even if it is a large one. So I'm guessing it is "in-place" per this definition but I'm unsure.
So given the above algorithm and definition of "in-place", is this implementation considered in-place?
The algorithm you have listed is decidedly not in-place.
You have another pointer (B) which must GROW to the same size as S, but is not in-place with S for any reason. Because of this you must have at least O(S) extra space. Just because you are removing the values from S does not keep you from still needing the same amount of space in another variable B.
Also just because the number of buckets might be constant doesn't mean you can forget that all elements in S must end up in a different place (B). Note the case where len(S) > N.
If you want to do an in-place sort, you need to keep all your elements in S and shuffle them around such that the constant extra space is a temporary holder for a swap routine and possibly some stack memory if you are using a recursive solution.
Bucketsort is definitely not an "in-place" sorting algorithm.
The whole idea is that elements sort themselves as they are moved to the buckets. In the worst of the good cases (sequential values, but no repetition) the additional space needed is as big as the original array.
The number of buckets N is only bounded (after mapping the keys) by the length n of the input sequence S because every item in S might have a different key. Hence the algorithm requires linear additional space and is in consequence not in-place. If the keys are not remapped, the additional space is unbound by n.

Resources