Find first unique number in an unsorted array - algorithm

I came across this question while going through previous interview questions. Any direction to approach this ?
Find first unique number in an unsorted array of 32 bit numbers
without using hash tables or array of counters.

Seeing that the input array is unsorted, you can solve the problem by sorting it. This is a bit silly - why give an answer to the question in the question itself? - but the technicalities of the sorting are a little interesting, so maybe this answer isn't trivial after all.
When looking at the array after sorting, you will find several numbers that are not equal to their predecessor and successor; from these, you want to choose the first one in the original array.
To do that efficiently, in your temporary array which is being sorted, for each number, store also the index of that number in the original array. So, at the end, choose the number which is not equal to its predecessor and successor, and which has the lowest index in the original array.

When you have to "do X without using Y", you can sometimes use Z, which has the same effect as Y, and argue that you were not using Y. Or you can disguise Y well enough so no one would recognize using it at first sight.
With that in mind, consider storing repetition counters for all the numbers in a trie. To choose the first number from the set of all unique numbers, store also the indices together with repetition counters.
I can claim that a trie is not an array of repetition counters, because you don't have to allocate and initialize 232 memory cells for the array. This is more like a glorified hashtable, but looks different enough.

Related

How to quickly search for a specified element in an ordered array consisting of only two types of elements?

The array mentioned in the question are as follows:
[1,1,...,1,1,-1,-1,...,-1,-1]
How to quickly find the index of the 1 closest to -1?
Note: Both 1 and -1 will exist at the same time, and the number of 1 and -1 is large.
For example, for an array like this:
[1,1,1,1,1,-1,-1,-1]
the result should be 4.
The fastest way I can think of is binary search, is there a faster way?
With the current representation of the data, binary search is the fastest way I can thing of. Of course, you can cache and reuse the results in constant time since the answer is always the same.
On the other hand if you change the representation of the array to some simple numbers you can find the next element in constant time. Since the data can always be mapped to a binary value, you can reduce the whole array to 2 numbers. The length of the first partition and the length of the second partition. Or the length of the whole array and the partitioning point. This way you can easily change the length of both partitions in constant time and have access to the next element of the second partition in constant time.
Of course, changing the representation of the array itself is a logarithmic process since you need to find the partitioning point.
By a simple information theoretic argument, you can't be faster than log(n) using only comparisons. Because there are n possible outcomes, and you need to collect at least log(n) bits of information to number them.
If you have extra information about the statistical distribution of the values, then maybe you can exploit it. But this is to be discussed on a case-by-case basis.

algorithm to accomplish comparing two arrays with user define criteria

I want to compare tow float arrays' value. But it may be different from other criteria. Here is how I define which array is the best.
Say we have two array named a,b.First, we compare the max value of these two array, and the array with smaller max value wins. If they have same value, then we can divide each array into two parts. The first part is a[1:max_loc(a)-1] and a[max_loc(a)+1,len(a)], and b is similar. Then we use the same criteria on a[1:max_loc(a)-1] and b[1:max_loc(b)-1] to see which array has the smaller max value. If they have the same max value on these intervals, then divide them to smaller arrays and do the same comparison. We also do the same thing for the a[max_loc(a)+1,len(a)] and b[max_loc(b)+1,len(b)]. Until we find smaller max value on the same intervals, the program end and print out the best array.
What's the algorithm to fulfill this comparison?
P.S. these two arrays may have different length.
Most of the time, what you search is somewhere already on the Internet :
https://www.ics.uci.edu/~eppstein/161/960118.html
Here you got 2 examples with full explanations which follows the divide and conquer idea (MergeSort and QuickSort)

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

First pair of numbers adding to a specific value in a stream

There are a stream of integers coming through. The problem is to find the first pair of numbers from the stream that adds to a specific value (say, k).
With static arrays, one can use either of the below approaches:
Approach (1): Sort the array, use two pointers to beginning and end of array and compare.
Approach (2): Use hashing, i.e. if A[i]+A[j]=k, then A[j]=k-A[i]. Search for A[j] in the hash table.
But neither of these approaches scale well for streams. Any thoughts on efficiently solving this?
I believe that there is no way to do this that doesn't use at least O(n) memory, where n is the number of elements that appear before the first pair that sums to k. I'm assuming that we are using a RAM machine, but not a machine that permits awful bitwise hackery (in other words, we can't do anything fancy with bit packing.)
The proof sketch is as follows. Suppose that we don't store all of the n elements that appear before the first pair that sums to k. Then when we see the nth element, which sums with some previous value to get k, there is a chance that we will have discarded the previous element that it pairs with and thus won't know that the sum of k has been reached. More formally, suppose that an adversary could watch what values we were storing in memory as we looked at the first n - 1 elements and noted that we didn't store some element x. Then the adversary could set the next element of the stream to be k - x and we would incorrectly report that the sum had not yet been reached, since we wouldn't remember seeing x.
Given that we need to store all the elements we've seen, without knowing more about the numbers in the stream, a very good approach would be to use a hash table that contains all of the elements we've seen so far. Given a good hash table, this would take expected O(n) memory and O(n) time to complete.
I am not sure whether there is a more clever strategy for solving this problem if you make stronger assumptions about the sorts of numbers in the stream, but I am fairly confident that this is asymptotically ideal in terms of time and space.
Hope this helps!

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Resources