Find a number with even number of occurrences - algorithm

Given an array where number of occurrences of each number is odd except one number whose number of occurrences is even. Find the number with even occurrences.
e.g.
1, 1, 2, 3, 1, 2, 5, 3, 3
Output should be:
2
The below are the constraints:
Numbers are not in range.
Do it in-place.
Required time complexity is O(N).
Array may contain negative numbers.
Array is not sorted.
With the above constraints, all my thoughts failed: comparison based sorting, counting sort, BST's, hashing, brute-force.
I am curious to know: Will XORing work here? If yes, how?

This problem has been occupying my subway rides for several days. Here are my thoughts.
If A. Webb is right and this problem comes from an interview or is some sort of academic problem, we should think about the (wrong) assumptions we are making, and maybe try to explore some simple cases.
The two extreme subproblems that come to mind are the following:
The array contains two values: one of them is repeated an even number of times, and the other is repeated an odd number of times.
The array contains n-1 different values: all values are present once, except one value that is present twice.
Maybe we should split cases by complexity of number of different values.
If we suppose that the number of different values is O(1), each array would have m different values, with m independent from n. In this case, we could loop through the original array erasing and counting occurrences of each value. In the example it would give
1, 1, 2, 3, 1, 2, 5, 3, 3 -> First value is 1 so count and erase all 1
2, 3, 2, 5, 3, 3 -> Second value is 2, count and erase
-> Stop because 2 was found an even number of times.
This would solve the first extreme example with a complexity of O(mn), which evaluates to O(n).
There's better: if the number of different values is O(1), we could count value appearances inside a hash map, go through them after reading the whole array and return the one that appears an even number of times. This woud still be considered O(1) memory.
The second extreme case would consist in finding the only repeated value inside an array.
This seems impossible in O(n), but there are special cases where we can: if the array has n elements and values inside are {1, n-1} + repeated value (or some variant like all numbers between x and y). In this case, we sum all the values, substract n(n-1)/2 from the sum, and retrieve the repeated value.
Solving the second extreme case with random values inside the array, or the general case where m is not constant on n, in constant memory and O(n) time seems impossible to me.
Extra note: here, XORing doesn't work because the number we want appears an even number of times and others appear an odd number of times. If the problem was "give the number that appears an odd number of times, all other numbers appear an even number of times" we could XOR all the values and find the odd one at the end.
We could try to look for a method using this logic: we would need something like a function, that applied an odd number of times on a number would yield 0, and an even number of times would be identity. Don't think this is possible.

Introduction
Here is a possible solution. It is rather contrived and not practical, but then, so is the problem. I would appreciate any comments if I have holes in my analysis. If this was a homework or challenge problem with an “official” solution, I’d also love to see that if the original poster is still about, given that more than a month has passed since it was asked.
First, we need to flesh out a few ill-specified details of the problem. Time complexity required is O(N), but what is N? Most commentators appear to be assuming N is the number of elements in the array. This would be okay if the numbers in the array were of fixed maximum size, in which case Michael G’s solution of radix sort would solve the problem. But, I interpret constraint #1, in absence of clarification by the original poster, as saying the maximum number of digits need not be fixed. Therefore, if n (lowercase) is the number of elements in the array, and m the average length of the elements, then the total input size to contend with is mn. A lower bound on the solution time is O(mn) because this is the read-through time of the input needed to verify a solution. So, we want a solution that is linear with respect to total input size N = nm.
For example, we might have n = m, that is sqrt(N) elements of sqrt(N) average length. A comparison sort would take O( log(N) sqrt(N) ) < O(N) operations, but this is not a victory, because the operations themselves on average take O(m) = O(sqrt(N)) time, so we are back to O( N log(N) ).
Also, a radix sort would take O(mn) = O(N) if m were the maximum length instead of average length. The maximum and average length would be on the same order if the numbers were assumed to fall in some bounded range, but if not we might have a small percentage with a large and variable number of digits and a large percentage with a small number of digits. For example, 10% of the numbers could be of length m^1.1 and 90% of length m*(1-10%*m^0.1)/90%. The average length would be m, but the maximum length m^1.1, so the radix sort would be O(m^1.1 n) > O(N).
Lest there be any concern that I have changed the problem definition too dramatically, my goal is still to describe an algorithm with time complexity linear to the number of elements, that is O(n). But, I will also need to perform operations of linear time complexity on the length of each element, so that on average over all the elements these operations will be O(m). Those operations will be multiplication and addition needed to compute hash functions on the elements and comparison. And if indeed this solution solves the problem in O(N) = O(nm), this should be optimal complexity as it takes the same time to verify an answer.
One other detail omitted from the problem definition is whether we are allowed to destroy the data as we process it. I am going to do so for the sake of simplicity, but I think with extra care it could be avoided.
Possible Solution
First, the constraint that there may be negative numbers is an empty one. With one pass through the data, we will record the minimum element, z, and the number of elements, n. On a second pass, we will add (3-z) to each element, so the smallest element is now 3. (Note that a constant number of numbers might overflow as a result, so we should do a constant number of additional passes through the data first to test these for solutions.) Once we have our solution, we simply subtract (3-z) to return it to its original form. Now we have available three special marker values 0, 1, and 2, which are not themselves elements.
Step 1
Use the median-of-medians selection algorithm to determine the 90th percentile element, p, of the array A and partition the array into set two sets S and T where S has the 10% of n elements greater than p and T has the elements less than p. This takes O(n) steps (with steps taking O(m) on average for O(N) total) time. Elements matching p could be placed either into S or T, but for the sake of simplicity, run through array once and test p and eliminate it by replacing it with 0. Set S originally spans indexes 0..s, where s is about 10% of n, and set T spans the remaining 90% of indexes s+1..n.
Step 2
Now we are going to loop through i in 0..s and for each element e_i we are going to compute a hash function h(e_i) into s+1..n. We’ll use universal hashing to get uniform distribution. So, our hashing function will do multiplication and addition and take linear time on each element with respect to its length.
We’ll use a modified linear probing strategy for collisions:
h(e_i) is occupied by a member of T (meaning A[ h(e_i) ] < p but is not a marker 1 or 2) or is 0. This is a hash table miss. Insert e_i by swapping elements from slots i and h(e_i).
h(e_i) is occupied by a member of S (meaning A[ h(e_i) ] > p) or markers 1 or 2. This is a hash table collision. Do linear probing until either encountering a duplicate of e_i or a member of T or 0.
If a member of T, this is a again a hash table miss, so insert e_i as in (1.) by swapping to slot i.
If a duplicate of e_i, this is a hash table hit. Examine the next element. If that element is 1 or 2, we’ve seen e_i more than once already, change 1s into 2s and vice versa to track its change in parity. If the next element is not 1 or 2, then we’ve only seen e_i once before. We want to store a 2 into the next element to indicate we’ve now seen e_i an even number of times. We look for the next “empty” slot, that is one occupied by a member of T which we’ll move to slot i, or a 0, and shift the elements back up to index h(e_i)+1 down so we have room next to h(e_i) to store our parity information. Note we do not need to store e_i itself again, so we’ve used up no extra space.
So basically we have a functional hash table with 9-fold the number of slots as elements we wish to hash. Once we start getting hits, we begin storing parity information as well, so we may end up with only 4.5-fold number of slots, still a very low load factor. There are several collision strategies that could work here, but since our load factor is low, the average number of collisions should be also be low and linear probing should resolve them with suitable time complexity on average.
Step 3
Once we finished hashing elements of 0..s into s+1..n, we traverse s+1..n. If we find an element of S followed by a 2, that is our goal element and we are done. Any element e of S followed by another element of S indicates e was encountered only once and can be zeroed out. Likewise e followed by a 1 means we saw e an odd number of times, and we can zero out the e and the marker 1.
Rinse and Repeat as Desired
If we have not found our goal element, we repeat the process. Our 90th percentile partition will move the 10% of n remaining largest elements to the beginning of A and the remaining elements, including the empty 0-marker slots to the end. We continue as before with the hashing. We have to do this at most 10 times as we process 10% of n each time.
Concluding Analysis
Partitioning via the median-of-medians algorithm has time complexity of O(N), which we do 10 times, still O(N). Each hash operation takes O(1) on average since the hash table load is low and there are O(n) hash operations in total performed (about 10% of n for each of the 10 repetitions). Each of the n elements have a hash function computed for them, with time complexity linear to their length, so on average over all the elements O(m). Thus, the hashing operations in aggregate are O(mn) = O(N). So, if I have analyzed this properly, then on whole this algorithm is O(N)+O(N)=O(N). (It is also O(n) if operations of addition, multiplication, comparison, and swapping are assumed to be constant time with respect to input.)
Note that this algorithm does not utilize the special nature of the problem definition that only one element has an even number of occurrences. That we did not utilize this special nature of the problem definition leaves open the possibility that a better (more clever) algorithm exists, but it would ultimately also have to be O(N).

See the following article: Sorting algorithm that runs in time O(n) and also sorts in place,
assuming that the maximum number of digits is constant, we can sort the array in-place in O(n) time.
After that it is a matter of counting each number's appearences, which will take in average n/2 time to find one number whose number of occurrences is even.

Related

Number of elements in array greater than given number

Okay , so I know this has been asked countless times because I googled in every form possible but could not get an answer.
I have an array say A= {10, 9, 6, 11, 22 }. I have to find number of elements greater than 11.
I know this can be done using Modified Binary Search but I need to do it in O(1) time. Is this possible?
(Keeping in mind we are taking the elements as input, so may be some pre-computation can be done while taking the input. )
Remove all the 0s from the array and count them. Now you know the result for input 0: n - count. Afterwards subtract 1 from all the remaining elements in the array. The goal of this step is to bring the numbers in the range of [0,999999999]. If the input is greater than 0 subtract one from it too otherwise return result immediately.
Sort the numbers and think of them as 9 digit strings (fill up with leading 0s).
Build the tree. Each node represents a digit. Each leaf has to store the amount of numbers greater than itself. I don't think the number of nodes will be too high. For the maximum n = 10^5 we can get about 5*10^5 nodes (10^5 different prefixes brings us down to about level 5 after that we have to have linked lists to the leaves 10^5 existing + 4*10^5 for the linked lists).
Now you have to go through all non-leaf nodes and for all the missing digits in the children create direct links to the next smaller leaf. About an additional 9*4*10^5 nodes if you represent the links as leaves with the same count as the next lower leaf.
I think now you can theoretically get O(1), because the complexity of the request doesn't depend on n and you will have to save much less than when creating a hash map. For the worst case you have to go down 9 nodes, this is a constant that is independent from n.
You might also consider first sorting the input and then inserting it in a Y-fast trie (https://en.wikipedia.org/wiki/Y-fast_trie), where each element will also point to its index in the sorted input, and thus the number of elements greater and lower than it. Y-fast tries support successor and predecessor lookup in O(log log M) time using O(n) space, where M is the range.
This answer makes the assumption that building the data structure itself does not have to be constant time, but only the retrieval part.
You can iterate through your array of numbers and build a binary tree. Each node in this tree will contain, in addition to the numerical value, two more points of data. These points will be the number of elements which each node is both greater than and less than. The insertion logic would be tricky, because this state would need to be maintained.
During insertion, while updating the counters for each node, we can also maintain a hashmap indexed by value. The keys would be the numbers in your array, and the value could be a wrapper containing the number of elements which this number is greater and less than. Since hashmaps have O(1) lookup time, this would satisfy your requirement.
If you need O(1) lookup time, only a hashmap comes to mind as an option. Note that traversing a binary tree, even if balanced, would still be a lg(N) operation in general. This is potentially quite fast, but still not constant.
The only way to decrease time complexity beyond this is to increase the space complexity.
If you have a range of elements of the array limited, lets say [-R1, R2], then you can build a hashmap over this range, pointing to linked list. You can precompute this hashMap, and then return results in o(1).

Algorithm to sort a list in Θ(n) time

The problem is to sort a list containing n distinct integers that range in value from 1 to kn inclusive where k is a fixed positive integer. Design an algorithm to solve the problem in Θ(n) time.
I don't just want an answer. An explanation would help, or if someone could get me pointed in the right direction.
I know that Θ(n) time means the algorithm time is directly proportional to the number of elements. Not sure where to go from there.
Easy for fixed k: Create an array of kn counters. Set them all to zero. Iterate through the array, increasing the counter i by one if an array element equals i. Use the array of counters to re-create the sorted array.
Obviously this is inefficient if k > log n.
The key is that the integers only range from 1 to kn, so their length is limited. This is a little tricky:
The common assumption when we say that a sorting algorithm is O(N) is that the number N fits into a constant number of machine words so that we can do math on numbers of that size in constant time. Following this assumption, kN also fits into a constant number of machine words, since k is a fixed positive integer. Your input is therefore O(N) words long, and each word is fixed number of bits, so your input is O(N) bits long.
Therefore, any algorithm that takes time proportional to the number of bits in the input is considered O(N).
There are actually lots of choices, but when this particular question is asked in this particular way, the person asking usually wants you to come up with a radix sort:
https://en.wikipedia.org/wiki/Radix_sort
The MSB-first radix sort just partitions the integers into 2^W buckets according to the values of their top W bits, and then partitions each bucket according to the next W bits, etc., until all the bits are processed.
The time taken for this is O(N*(word_size/W)), but as we said the word size is constant, and W is constant, so this is O(N).

Finding a specific ratio in an unsorted array. Time complexity

This is a homework assignment.
The goal is to present an algorithm in pseudocode that will search an array of numbers (doesn't specify if integers or >0) and check if the ratio of any two numbers equals a given x. Time complexity must be under O(nlogn).
My idea was to mergesort the array (O(nlogn) time) and then if |x| > 1 start checking for every number in desending order (using a binary traversal algorithm). The check should also take O(logn) time for each number, with a worst case of n checks gives a total of O(nlogn). If I am not missing anything this should give us a worst case of O(nlogn) + O(nlogn) = O(nlogn), within the parameters of the assignment.
I realize that it doesn't really matter where I start checking the ratios after sorting, but the time cost is amortized by 1/2).
Is my logic correct? Is there a faster algorithm?
An example in case it isn't clear:
Given an array { 4, 9, 2, 1, 8, 6 }
If we want to seach for a ratio of 2:
Mergesort { 9, 8, 6, 4, 2, 1 }
Since the given ratio is >1 we will search from left to right.
2a. First number is 9. Checking 9 / 4 > 2. Checking 9/6 < 2 Next Number.
2b. Second number is 8. Checking 8 / 4 = 2. DONE
The analysis you have presented is correct and is a perfectly good way to solve this problem. Sorting does work in time O(n log n), and 2n binary searches also takes O(n log n) time. That said, I don't think you want to use the term "amortized" here, since that refers to a different type of analysis.
As a hint for how to speed up your solution a bit, the general idea of your solution is to make it possible to efficiently query, for any number, whether that number exists in the array. That way, you can just loop over all numbers and look for anything that would make the ratio work. However, if you use an auxiliary data structure outside the array that supports fast access, you can possibly whittle down your runtime at the cost of increasing the memory usage. Try thinking about what data structures support very fast access (say, O(1) lookups) and see if you can use any of them here.
Hope this helps!
to solve this problem, only O(nlgn) is enough
step 1, sort the array. that cost O(nlgn)
step 2, check whether the ratio exists, this step only needs o(n)
u just need two pointers, one points to the first element(smallest one), another points to the last element(biggest one).
calculate the ratio.
if the ratio is bigger than the specified one, move the second pointer to its previous element.
if the ratio is smaller than the specified one, move the first pointer to its next element.
repeat the above steps until:
u find the exact ratio, or
either the first pointer reaches the end, or the second point reaches the beginning
The complexity of your algorithm is O(n²), because after sorting the array, you iterate over each element (up to n times) and in each iteration you execute up to n - 1 divisions.
Instead, after sorting the array, iterate over each element, and in each iteration divide the element by the ratio, then see if the result is contained in the array:
division: O(1)
search in sorted list: O(log n)
repeat for each element: n times
Results in time complexity O(n log n)
In your example:
9/2 = 4.5 (not found)
8/2 = 4 (found)
(1) Build a hashmap of this array. Time Cost: O(n)
(2) For every element a[i], search a[i]*x in HashMap. Time Cost: O(n).
Total Cost: O(n)

Prove that the running time of quick sort after modification = O(Nk)

this is a homework question, and I'm not that at finding the complixity but I'm trying my best!
Three-way partitioning is a modification of quicksort that partitions elements into groups smaller than, equal to, and larger than the pivot. Only the groups of smaller and larger elements need to be recursively sorted. Show that if there are N items but only k unique values (in other words there are many duplicates), then the running time of this modification to quicksort is O(Nk).
my try:
on the average case:
the tree subroutines will be at these indices:
I assume that the subroutine that have duplicated items will equal (n-k)
first: from 0 - to(i-1)
Second: i - (i+(n-k-1))
third: (i+n-k) - (n-1)
number of comparisons = (n-k)-1
So,
T(n) = (n-k)-1 + Sigma from 0 until (n-k-1) [ T(i) + T (i-k)]
then I'm not sure how I'm gonna continue :S
It might be a very bad start though :$
Hope to find a help
First of all, you shouldn't look at the average case since the upper bound of O(nk) can be proved for the worst case, which is a stronger statement.
You should look at the maximum possible depth of recursion. In normal quicksort, the maximum depth is n. For each level, the total number of operations done is O(n), which gives O(n^2) total in the worst case.
Here, it's not hard to prove that the maximum possible depth is k (since one unique value will be removed at each level), which leads to O(nk) total.
I don't have a formal education in complexity. But if you think about it as a mathematical problem, you can prove it as a mathematical proof.
For all sorting algorithms, the best case scenario will always be O(n) for n elements because to sort n elements you have to consider each one atleast once. Now, for your particular optimisation of quicksort, what you have done is simplified the issue because now, you are only sorting unique values: All the values that are the same as the pivot are already considered sorted, and by virtue of its nature, quicksort will guarantee that every unique value will feature as the pivot at some point in the operation, so this eliminates duplicates.
This means for an N size list, quicksort must perform some operation N times (once for every position in the list), and because it is trying to sort the list, that operation is trying to find the position of that value in the list, but because you are effectively dealing with just unique values, and there are k of those, the quicksort algorithm must perform k comparisons for each element. So it performs Nk operations for an N sized list with k unique elements.
To summarise:
This algorithm eliminates checking against duplicate values.
But all sorting algorithms must look at every value in the list at least once. N operations
For every value in the list the operation is to find its position relative to other values in the list.
Because duplicates get removed, this leaves only k values to check against.
O(Nk)

Finding number of pairs of integers differing by a value

If we have an array of integers, then is there any efficient way other than O(n^2) by which one can find the number of pairs of integers which differ by a given value?
E.g for the array 4,2,6,7 the number of pairs of integers differing by 2 is 2 {(2,4),(4,6)}.
Thanks.
Create a set from your list. Create another set which has all the elements incremented by the delta. Intersect the two sets. These are the upper values of your pairs.
In Python:
>>> s = [4,2,6,7]
>>> d = 2
>>> s0 = set(s)
>>> sd = set(x+d for x in s0)
>>> set((x-d, x) for x in (s0 & sd))
set([(2, 4), (4, 6)])
Creating the sets is O(n). Intersecting the sets is also O(n), so this is a linear-time algorithm.
Store the elements in a multiset, implemented by a hash table. Then for each element n, check the number of occurences of n-2 in the multiset and sum them up. There is no need to check n+2 because that would cause you to count each pair twice.
The time efficiency is O(n) in the average case, and O(n*logn) or O(n^2) in the worst case (depending on the hash table implementation). It will be O(n*logn) if the multiset is implemented by a balanced tree.
Sort the array, then scan through with two pointers. Supposing the first one points to a, then step the second one forward until you've found where a+2 would be if it was present. Increment the total if it's there. Then increment the first pointer and repeat. At each step, the second pointer starts from the place it ended up on the previous step.
If duplicates are allowed in the array, then you need to remember how many duplicates the second one stepped over, so that you can add this number to the total if incrementing the first pointer yields the same integer again.
This is O(n log n) worst case (for the sort), since the scan is linear time.
It's O(n) worst case on the same basis that hashtable-based solutions for fixed-width integers can say that they're expected O(n) time, since sorting fixed-width integers can be done using radix sort in O(n). Which is actually faster is another matter -- hashtables are fast but might involve a lot of memory allocation (for nodes) and/or badly-localized memory access, depending on implementation.
Note that if the desired difference is 0 and all the elements in the array are identical, then the size of the output is O(n²), so the worst-case of any algorithm is necessarily O(n²). (On the other hand, average-case or expected-case behavior can be significantly better, as others have noted.)
Just hash the numbers in an array as you do in counting sort.Then take two variables, first pointing to index 0 and the other pointing to index 2(or index d in general case) initially.
Now check whether value at both indices are non-zero, if yes then increment the counter with larger of the two values else leave the counter unchanged as the pair does not exist. Now increment both the indices and continue until the second index reaches the end of the array.The total value of counter is the number of pairs with difference d.
Time complexity: O(n)
Space complexity: O(n)

Resources