Fastest unconditional sort algorithm - algorithm

I have a function, which can take two elements and return them back in ascending order:
void Sort2(int &a, int &b) {
if (a < b) return;
int t = a;
a = b;
b = t;
}
what is the fastest way to sort an array with N entries using this function if I am not allowed to use extra conditional operators?
That means that whole my program should look like this:
int main(){
int a[N];
// fill a array
const int NS = ...; // number of comparison, depending on N.
const int c[NS] = { {0,1}, {0,2}, ... }; // consequence of indices pairs generated depending on N.
for( int i = 0; i < NS; i++ ) {
Sort2(a[c[i][0]], a[c[i][1]]);
}
// sort is finished
return 1;
}
Most of the fast sort algorithms use conditions to decide what to do. There is bubble sort of course, but it takes M = N(N-1)/2 comparisons. This is not the optimum, for instance, with N = 4 it takes M = 6 comparison, meanwhile 4 entries can be sorted with 5:
Sort2(a[0],a[1]);
Sort2(a[2],a[3]);
Sort2(a[1],a[3]);
Sort2(a[0],a[2]);
Sort2(a[1],a[2]);

The standard approach is known as Bitonic Mergesort. It is hella efficient when paralellized, and only slightly less efficient than conventional algorithms when not parallelized. Bitonic mergesort is a special kind of a wider class of algorithms known as "sorting networks"; it is unusual among sorting networks in that some of its reorderings are in reverse order of the desired sort (though everything is in the correct order once the algorithm completes). You can do that with your Sort2 by passing in a higher array slot for the first argument than the second.

For N a power of 2 you can generalize the approach you used, by using a "merge-sortish" kind of approach: you sort the first half and the last half separately, and then merge these using a few comparisons.
For instance, consider an array of size 8. And assume that the first half is sorted and the last half is sorted (by applying this same approach recursively):
A B C D P Q R S
In the first round, you do a comparison of 1 vs 1, 2 vs 2, etc:
---------
| |
| ---------
| | | |
A B C D P Q R S
| | | |
| ---------
| |
---------
After this round, the first and the last element are in the right place, so you need to repeat the process for the inner 6 elements (I keep the names of the elements the same, because it is unknown where they end up):
-------
| |
| -------
| | | |
A B C D P Q R S
| |
-------
In the next round, the inner 4 elements are compared, and in the last round the inner 2.
Let f(n) be the number of comparisons needed to sort an array of length n (where n is a power of 2, for the moment). Clearly, an array consisting of 1 element is sorted already:
f(1) = 0
For a longer array, you first need to sort both halves, and then perform the procedure described above. For n=8, that took 4+3+2+1 = (n/2)(n/2+1)/2 comparisons. Hence in general:
f(n) = 2 f(n/2) + (n/2)(n/2+1)/2
Note that for n=4, this indeed gives:
f(4) = 2 f(2) + 2*3/2
= 2 * (2 f(1) + 1*2/2) + 3
= 5
To facilitate ns that are no power of 2, the important thing is to do the merging step on an odd-length array. The simplest strategy seems to be to compare the smallest element of both subarrays (which yields the smallest element) and then just continue on the rest of the array (which has now even length).
If we write g(k) = k(k+1)/2, we can now have a short way of writing the recursive formula (I use 2k and 2k+1 to distinguish even and odd):
f(1) = 0
f(2k) = 2 f(k) + g(k)
f(2k+1) = f(k+1) + f(k) + 1 + g(k)
Some pseudocode on how to approach this:
function sort(A, start, length) {
if (length == 1) {
// do nothing
} else if (length is even) {
sort(A, start, length/2)
sort(A, start+length/2, length/2)
merge(A, start, length)
} else if (length is odd) {
sort(A, start, length/2+1)
sort(A, start+length/2+1, length/2)
Sort2(A[start], A[start+length/2+1])
merge(A, start+1, length-1)
}
}
function merge(A, start, length) {
if (length > 0) {
for (i = 0; i < length/2; i++)
Sort2(A[i], A[i]+length/2)
merge(A, start+1, length-2)
}
}
And you would run this on your array by
sort(A, 0, A.length)

Related

Could anyone tell me a better solution for this problem? I could only think of brute force way which is O(n^2)

Recently I was attempting the following problem:
Given an array of integers, arr.
Find sum of floor of (arr[i]/arr[j]) for all pairs of indices (i,j).
e.g.
Input: arr[]={1,2,3,4,5}
Output: Sum=27.
Explanation:
(1/1)+(1/5)+(1/4)+(1/2)+(1/3) = 1+0+0+0+0 = 1
(5/1)+(5/5)+(5/4)+(5/2)+(5/3) = 5+1+1+2+1 = 10
(4/1)+(4/5)+(4/4)+(4/2)+(4/3) = 4+0+1+2+1 = 8
(2/1)+(2/5)+(2/4)+(2/2)+(2/3) = 2+0+0+1+0 = 3
(3/1)+(3/5)+(3/4)+(3/2)+(3/3) = 3+0+0+1+1 = 5
I could only think of naive O(n^2) solution. Is there any other better approach?
Thanks in advance.
A possibility resides in "quickly" skipping the elements that are the same integer multiple of a given element (after rounding).
For the given example, the vertical bars below delimit runs of equal ratios (the lower triangle is all zeroes and ignored; I show the elements on the left and the ratios on the right):
1 -> 2 | 3 | 4 | 5 ≡ 2 | 3 | 4 | 5
2 -> 3 | 4 5 ≡ 1 | 2 2
3 -> 4 5 ≡ 1 1
4 -> 5 ≡ 1
For bigger arrays, the constant runs can be longer.
So the algorithm principle is
sort all elements increasingly;
process the elements from smallest to largest;
for a given element, find the index of the first double and count the number of skipped elements;
from there, find the index of the first triple and count twice the number of skipped elements;
continue with higher multiples until you exhaust the tail of the array.
A critical operation is to "find the next multiple". This should be done by an exponential search followed by a dichotomic search, so that the number of operations remains logarithmic in the number of elements to skip (a pure dichotomic search would be logarithmic in the total number of remaining elements). Hence the cost of a search will be proportional to the sum of the logarithms of the distances between the multiples.
Hopefully, this sum will be smaller than the sum of the distances themselves, though in the worst case the complexity remains O(N). In the best case, O(log(N)).
A global analysis is difficult and in theory the worst-case complexity remains O(N²); but in practice it could go down to O(N log N), because the worst case would require that the elements grow faster than a geometric progression of common ratio 2.
Addendum:
If the array contains numerous repeated values, it can be beneficial to compress it by storing a repetition count and a single instance of every value. This can be done after sorting.
int[] arr = { 1, 2, 3, 4, 5 };
int result = 0;
int BuffSize = arr.Max() * 2;
int[] b = new int[BuffSize + 1];
int[] count = new int[BuffSize];
for (int i = 0; i < arr.Length; ++i)
count[arr[i]]++;
for (int i = BuffSize - 1; i >= 1; i--)
{
b[i] = b[i + 1] + count[i];
}
for (int i = 1; i < BuffSize; i++)
{
if (count[i] == 0)
{
continue;
}
for (int j = i, mul = 1; j < BuffSize; j += i, mul++)
{
result += 1 * (b[j] - b[Math.Min(BuffSize - 1, j + i)]) * mul * count[i];
}
}
This code takes advantage of knowing difference between each successive value ahead of time, and only process the remaining portion of the array rather than redundantly processing the entire thing n^2 times,
I believe it has a worst case runtime of O(n*sqrt(n)*log(n))

Time complexity of an algorithm with two nested loops

Given this algorithm :
m = 1
while(a>m*b){
m = m*2
}
while(a>=b){
while(a>=m*b){
a = a-m*b
}
m=m/2
}
My question : What is the time complexity of this algorithm ?
What I have done : I have to find the number of instructions. So I found out that, for the first while, there is m=log_2(a/b) iterations approximately. Now for the inner while of the second part of this algorithm, I found this pattern : a_i = a - i*m where i is the number of iterations. So there is a/bm iterations for the inner while.
But I don't know how to calculate the outer now because the condition depends on what the inner while have done to a.
Let's begin by "normalizing" the function in the same way as in your previous question, noting that once again all changes in a and stopping conditions are proportional to b:
n = a/b
// 1)
m = 1
while(n>m){
m = m*2
}
// 2)
while(n>=1){
while(n>=m){
n = n-m
}
m=m/2
}
Unfortunately, this is where the similarity ends...
Snippet 1)
Note that m can be written as an integer power of 2, since it doubles every loop:
i = 0
while (n > pow(2, i)) {
i++
}
// m = pow(2, i)
From the stopping condition:
Snippet 2)
Here m decreases in the exact opposite way to 1), so it can again be written as a power of 2:
// using i from the end of 1)
while (n>=1) {
k = pow(2, i)
while (n >= k) {
n = n - k
}
i--
}
The inner loop is simpler than the inner loop from your previous question, because m does not change inside it. It is easy to deduce the number of times c it executes, and the value of n at the end:
This is the exact definition of the Modulus operator % in the "C-family" of languages:
while (n>=1) {
k = pow(2, i)
n = n % k // time complexity O(n / k) here instead of O(1)
i--
}
Note that, because consecutive values of k only differ by a factor of 2, at no point will the value of n be greater than or equal to 2k; this means that the inner loop executes at most once per outer loop. Therefore the outer loop executes at most i times.
Both the first and second loops are O(log n), which means the total time complexity is O(log n) = O(log [a/b]).
Update: numerical tests in Javascript as before.
function T(n)
{
let t = 0;
let m = 1;
while (n > m) {
m *= 2; t++;
}
while (n >= 1) {
while (n >= m) {
n -= m; t++;
}
m/=2;
}
return t;
}
Plotting T(n) against log(n) shows a nice straight line:
Edit: a more thorough explanation of snippet 2).
At the end of snippet 1), the value of i = ceil(log2(n)) represents the number of significant bits in the binary representation of the integer ceil(n).
Computing the modulus of an integer with a positive power-of-2 2^i is equivalent to discarding all but the first i bits. For example:
n = ...00011111111 (binary)
m = ...00000100000 (= 2^5)
n % m = ...00000011111
----- (5 least significant bits)
The operation of snippet 2) is therefore equivalent to removing the most significant bit of n, one at a time, until only zero is left. For example:
outer loop no | n
----------------------------
1 | ...110101101
| ^
2 | ...010101101
| ^
3 | ...000101101
| ^
4 | ...000001101
| ^
: | :
: | :
i (=9) | ...000000001
| ^
----------------------------
final | 000000000
When the current most significant bit (pointed to by ^) is:
0: the inner loop does not execute because the value of n is already smaller than k = 2^i (equal to the bit position value of ^).
1: the inner loop executes once because n is greater than k, but less than 2k (which corresponds the bit above the current position ^).
Hence the "worst" case occurs when all significant bits of n are 1, in which case the inner loop to always executes once.
Regardless, the outer loop executes ceil(log2(n)) times for any value of n.

nth smallest element in a union of an array of intervals with repetition

I want to know if there is a more efficient solution than what I came up with(not coded it yet but described the gist of it at the bottom).
Write a function calcNthSmallest(n, intervals) which takes as input a non-negative int n, and a list of intervals [[a_1; b_1]; : : : ; [a_m; b_m]] and calculates the nth smallest number (0-indexed) when taking the union of all the intervals with repetition. For example, if the intervals were [1; 5]; [2; 4]; [7; 9], their union with repetition would be [1; 2; 2; 3; 3; 4; 4; 5; 7; 8; 9] (note 2; 3; 4 each appear twice since they're in both the intervals [1; 5] and [2; 4]). For this list of intervals, the 0th smallest number would be 1, and the 3rd and 4th smallest would both be 3. Your implementation should run quickly even when the a_i; b_i can be very large (like, one trillion), and there are several intervals
The way I thought to go about it is the straightforward solution which is to make the union array and traverse it.
This problem can be solved in O(N log N) where N is the number of intervals in the list, regardless of the actual values of the interval endpoints.
The key to solving this problem efficiently is to transform the list of possibly-overlapping intervals into a list of intervals which are either disjoint or identical. In the given example, only the first interval needs to be split:
{ [1,5], [2,4], [7,9]} =>
+-----------------+ +---+ +---+
{[1,1], [2,4], [5,5], [2,4], [7,9]}
(This doesn't have to be done explicitly, though: see below.) Now, we can sort the new intervals, replacing duplicates with a count. From that, we can compute the number of values each (possibly-duplicated) interval represents. Now, we simply need to accumulate the values to figure out which interval the solution lies in:
interval count size values cumulative
in interval values
[1,1] 1 1 1 [0, 1)
[2,4] 2 3 6 [1, 7) (eg. from n=1 to n=6 will be here)
[5,5] 1 1 1 [7, 8)
[7,9] 1 3 3 [8, 11)
I wrote the cumulative values as a list of half-open intervals, but obviously we only need the end-points. We can then find which interval holds value n by, for example, binary-searching the cumulative values list, and we can figure out which value in the interval we want by subtracting the start of the interval from n and then integer-dividing by the count.
It should be clear that the maximum size of the above table is twice the number of original intervals, because every row must start and end at either the start or end of some interval in the original list. If we'd written the intervals as half-open instead of closed, this would be even clearer; in that case, we can assert that the precise size of the table will be the number of unique values in the collection of end-points. And from that insight, we can see that we don't really need the table at all; we just need the sorted list of end-points (although we need to know which endpoint each value represents). We can simply iterate through that list, maintaining the count of the number of active intervals, until we reach the value we're looking for.
Here's a quick python implementation. It could be improved.
def combineIntervals(intervals):
# endpoints will map each endpoint to a count
endpoints = {}
# These two lists represent the start and (1+end) of each interval
# Each start adds 1 to the count, and each limit subtracts 1
for start in (i[0] for i in intervals):
endpoints[start] = endpoints.setdefault(start, 0) + 1
for limit in (i[1]+1 for i in intervals):
endpoints[limit] = endpoints.setdefault(limit, 0) - 1
# Filtering is a possibly premature optimization but it was easy
return sorted(filter(lambda kv: kv[1] != 0,
endpoints.iteritems()))
def nthSmallestInIntervalList(n, intervals):
limits = combineIntervals(intervals)
cumulative = 0
count = 0
index = 0
here = limits[0][0]
while index < len(limits):
size = limits[index][0] - here
if n < cumulative + count * size:
# [here, next) contains the value we're searching for
return here + (n - cumulative) / count
# advance
cumulative += count * size
count += limits[index][1]
here += size
index += 1
# We didn't find it. We could throw an error
So, as I said, the running time of this algorithm is independent of the actual values of the intervals; it only depends in the length of the interval list. This particular solution is O(N log N) because of the cost of the sort (in combineIntervals); if we used a priority queue instead of a full sort, we could construct the heap in O(N) but making the scan O(log N) for each scanned endpoint. Unless N is really big and the expected value of the argument n is relatively small, this would be counter-productive. There might be other ways to reduce complexity, though.
Edit2:
Here's yet another take on your question.
Let's consider the intervals graphically:
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
[-------]
[-----------------]
[---------]
[--------------]
[-----]
When sorted in increasing order on the lower bound, we could get something that looks like the above for the interval list ([2;10];[4;24];[7;17];[13;30];[20;27]). Each lower bound indicates the start of a new interval, and would also marks the beginning of one more "level" of duplication of the numbers. Conversely, upper bounds mark the end of that level, and decrease the duplication level of one.
We could therefore convert the above into the following list:
[2;+];[4;+];[7;+][10;-];[13;+];[17;-][20;+];[24;-];[27;-];[30;-]
Where the first value indicates the rank of the bound, and the second value whether the bound is lower (+) or upper (-). The computation of the nth element is done by following the list, raising or lowering the duplication level when encountering an lower or upper bound, and using the duplication level as a counting factor.
Let's consider again the list graphically, but as an histogram:
3333 44444 5555
2222222333333344444555
111111111222222222222444444
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
The view above is the same as the first one, with all the intervals packed vertically.
1 being the elements of the 1st one, 2 the second one, etc. In fact, what matters here
is the height at each index, corresponding of the number of time each index is duplicated in the union of all intervals.
3333 55555 7777
2223333445555567777888
112223333445555567777888999
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
| | | | | | || | |
We can see that histogram blocks start at lower bounds of intervals, and end either on upper bounds, or one unit before lower bounds, so the new notation must be modified accordingly.
With a list containing n intervals, as a first step, we convert the list into the notation above (O(n)), and sort it in increasing bound order (O(nlog(n))). The second step of computing the number is then in O(n), for a total average time in O(nlog(n)).
Here's a simple implementation in OCaml, using 1 and -1 instead of '+' and '-'.
(* transform the list in the correct notation *)
let rec convert = function
[] -> []
| (l,u)::xs -> (l,1)::(u+1,-1)::convert xs;;
(* the counting function *)
let rec count r f = function
[] -> raise Not_found
| [a,x] -> (match f + x with
0 -> if r = 0 then a else raise Not_found
| _ -> a + (r / f))
| (a,x)::(b,y)::l ->
if a = b
then count r f ((b,x+y)::l)
else
let f = f + x in
if f > 0 then
let range = (b - a) * f in
if range > r
then a + (r / f)
else count (r - range) f ((b,y)::l)
else count r f ((b,y)::l);;
(* the compute function *)
let compute l =
let compare (x,_) (y,_) = compare x y in
let l = List.sort compare (convert l) in
fun m -> count m 0 l;;
Notes:
- the function above will raise an exception if the sought number is above the intervals. This corner case isn't taken in account by the other methods below.
- the list sorting function used in OCaml is merge sort, which effectively performs in O(nlog(n)).
Edit:
Seeing that you might have very large intervals, the solution I gave initially (see down below) is far from optimal.
Instead, we could make things much faster by transforming the list:
we try to compress the interval list by searching for overlapping ones and replace them by prefixing intervals, several times the overlapping one, and suffixing intervals. We can then directly compute the number of entries covered by each element of the list.
Looking at the splitting above (prefix, infix, suffix), we see that the optimal structure to do the processing is a binary tree. A node of that tree may optionally have a prefix and a suffix. So the node must contain :
an interval i in the node
an integer giving the number of repetition of i in the list,
a left subtree of all the intervals below i
a right subtree of all the intervals above i
with this structure in place, the tree is automatically sorted.
Here's an example of an ocaml type embodying that tree.
type tree = Empty | Node of int * interval * tree * tree
Now the transformation algorithm boils down to building the tree.
This function create a tree out of its component:
let cons k r lt rt =
the tree made of count k, interval r, left tree lt and right tree rt
This function recursively insert an interval in a tree.
let rec insert i it =
let r = root of it
let lt = the left subtree of it
let rt = the right subtree of it
let k = the count of r
let prf, inf, suf = the prefix, infix and suffix of i according to r
return cons (k+1) inf (insert prf lt) (insert suf rt)
Once the tree is built, we do a pre-order traversal of the tree, using the count of the node to accelerate the computation of the nth element.
Below is my previous answer.
Here are the steps of my solution:
you need to sort the interval list in increasing order on the lower bound of each interval
you need a deque dq (or a list which will be reversed at some point) to store the intervals
here's the code:
let lower i = lower bound of interval i
let upper i = upper bound of i
let il = sort of interval list
i <- 0
j <- lower (head of il)
loop on il:
i <- i + 1
let h = the head of il
let il = the tail of il
if upper h > j then push h to dq
if lower h > j then
il <- concat dq and il
j <- j + 1
dq <- empty
loop
if i = k then return j
loop
This algorithm works by simply iterating through the intervals, only taking in account the relevant intervals, and counting both the rank i of the element in the union, and the value j of that element. When the targeted rank k has been reached, the value is returned.
The complexity is roughly in O(k) + O(sort(l)).
if i have understood your question correctly, you want to find the kth largest element in union of list of intervals.
If we assume that no of list = 2 the question is :
Find the kth smallest element in union of two sorted arrays (where an interval [2,5] is nothing but elements from 2 to 5 {2,3,4,5}) this sollution can be solved in (n+m)log(n+m) time where (n and m are sizes of list) . where i and j are list iterators .
Maintaining the invariant
i + j = k – 1,
If Bj-1 < Ai < Bj, then Ai must be the k-th smallest,
or else if Ai-1 < Bj < Ai, then Bj must be the k-th smallest.
For details click here
Now the problem is if you have no of lists=3 lists then
Maintaining the invariant
i + j+ x = k – 1,
i + j=k-x-1
The value k-x-1 can take y (size of third list, because x iterates from start point of list to end point) .
problem of 3 lists size can be reduced to y*(problem of size 2 list). So complexity is `y*((n+m)log(n+m))`
If Bj-1 < Ai < Bj, then Ai must be the k-th smallest,
or else if Ai-1 < Bj < Ai, then Bj must be the k-th smallest.
So for problem of size n list the complexity is NP .
But yes we can do minor improvement if we know that k< sizeof(some lists) we can chop the elements starting from k+1th element to end(from our search space ) in those list whose size is bigger than k (i think it doesnt help for large k).If there is any mistake please let me know.
Let me explain with an example:
Assume we are given these intervals [5,12],[3,9],[8,13].
The union of these intervals is:
number : 3 4 5 5 6 6 7 7 8 8 8 9 9 9 10 10 11 11 12 12 13.
indices: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The lowest will return 11 when 9 is passed an input.
The highest will return 14 when 9 is passed an input.
Lowest and highest function just check whether the x is present in that interval, if it is present then adds x-a(lower index of interval) to return value for that one particular interval. If an interval is completely smaller than x, then adds total number of elements in that interval to the return value.
The find function will return 9 when 13 is passed.
The find function will use the concept of binary search to find the kth smallest element. In the given range [0,N] (if range is not given we can find high range in O(n)) find the mid and calculate the lowest and highest for mid. If given k falls in between lowest and highest return mid else if k is less than or equal to lowest search in the lower half(0,mid-1) else search in the upper half(mid+1,high).
If the number of intervals are n and the range is N, then the running time of this algorithm is n*log(N). we will find lowest and highest (which runs in O(n)) log(N) times.
//Function call will be `find(0,N,k,in)`
//Retrieves the no.of smaller elements than first x(excluding) in union
public static int lowest(List<List<Integer>> in, int x){
int sum = 0;
for(List<Integer> lst: in){
if(x > lst.get(1))
sum += lst.get(1) - lst.get(0)+1;
else if((x >= lst.get(0) && x<lst.get(1)) || (x > lst.get(0) && x<=lst.get(1))){
sum += x - lst.get(0);
}
}
return sum;
}
//Retrieve the no.of smaller elements than last x(including) in union.
public static int highest(List<List<Integer>> in, int x){
int sum = 0;
for(List<Integer> lst: in){
if(x > lst.get(1))
sum += lst.get(1) - lst.get(0)+1;
else if((x >= lst.get(0) && x<lst.get(1)) || (x > lst.get(0) && x<=lst.get(1))){
sum += x - lst.get(0)+1;
}
}
return sum;
}
//Do binary search on the range.
public static int find(int low, int high, int k,List<List<Integer>> in){
if(low > high)
return -1;
int mid = low + (high-low)/2;
int lowIdx = lowest(in,mid);
int highIdx = highest(in,mid);
//k lies between the current numbers high and low indices
if(k > lowIdx && k <= highIdx) return mid;
//k less than lower index. go on to left side
if(k <= lowIdx) return find(low,mid-1,k,in);
// k greater than higher index go to right
if(k > highIdx) return find(mid+1,high,k,in);
else
return -1; // catch statement
}
It's possible to count how many numbers in the list are less than some chosen number X (by iterating through all of the intervals). Now, if this number is greater than n, the solution is certainly smaller than X. Similarly, if this number is less than or equal to n, the solution is greater than or equal to X. Based on these observation we can use binary search.
Below is a Java implementation :
public int nthElement( int[] lowerBound, int[] upperBound, int n )
{
int lo = Integer.MIN_VALUE, hi = Integer.MAX_VALUE;
while ( lo < hi ) {
int X = (int)( ((long)lo+hi+1)/2 );
long count = 0;
for ( int i=0; i<lowerBound.length; ++i ) {
if ( X >= lowerBound[i] && X <= upperBound[i] ) {
// part of interval i is less than X
count += (long)X - lowerBound[i];
}
if ( X >= lowerBound[i] && X > upperBound[i] ) {
// all numbers in interval i are less than X
count += (long)upperBound[i] - lowerBound[i] + 1;
}
}
if ( count <= n ) lo = X;
else hi = X-1;
}
return lo;
}

What is the probability that the array will remain the same?

This question has been asked in Microsoft interview. Very much curious to know why these people ask so strange questions on probability?
Given a rand(N), a random generator which generates random number from 0 to N-1.
int A[N]; // An array of size N
for(i = 0; i < N; i++)
{
int m = rand(N);
int n = rand(N);
swap(A[m],A[n]);
}
EDIT: Note that the seed is not fixed.
what is the probability that array A remains the same?
Assume that the array contains unique elements.
Well I had a little fun with this one. The first thing I thought of when I first read the problem was group theory (the symmetric group Sn, in particular). The for loop simply builds a permutation σ in Sn by composing transpositions (i.e. swaps) on each iteration. My math is not all that spectacular and I'm a little rusty, so if my notation is off bear with me.
Overview
Let A be the event that our array is unchanged after permutation. We are ultimately asked to find the probability of event A, Pr(A).
My solution attempts to follow the following procedure:
Consider all possible permutations (i.e. reorderings of our array)
Partition these permutations into disjoint sets based on the number of so-called identity transpositions they contain. This helps reduce the problem to even permutations only.
Determine the probability of obtaining the identity permutation given that the permutation is even (and of a particular length).
Sum these probabilities to obtain the overall probability the array is unchanged.
1) Possible Outcomes
Notice that each iteration of the for loop creates a swap (or transposition) that results one of two things (but never both):
Two elements are swapped.
An element is swapped with itself. For our intents and purposes, the array is unchanged.
We label the second case. Let's define an identity transposition as follows:
An identity transposition occurs when a number is swapped with itself.
That is, when n == m in the above for loop.
For any given run of the listed code, we compose N transpositions. There can be 0, 1, 2, ... , N of the identity transpositions appearing in this "chain".
For example, consider an N = 3 case:
Given our input [0, 1, 2].
Swap (0 1) and get [1, 0, 2].
Swap (1 1) and get [1, 0, 2]. ** Here is an identity **
Swap (2 2) and get [1, 0, 2]. ** And another **
Note that there is an odd number of non-identity transpositions (1) and the array is changed.
2) Partitioning Based On the Number of Identity Transpositions
Let K_i be the event that i identity transpositions appear in a given permutation. Note this forms an exhaustive partition of all possible outcomes:
No permutation can have two different quantities of identity transpositions simultaneously, and
All possible permutations must have between 0 and N identity transpositions.
Thus we can apply the Law of Total Probability:
Now we can finally take advantage of the the partition. Note that when the number of non-identity transpositions is odd, there is no way the array can go unchanged*. Thus:
*From group theory, a permutation is even or odd but never both. Therefore an odd permutation cannot be the identity permutation (since the identity permutation is even).
3) Determining Probabilities
So we now must determine two probabilities for N-i even:
The First Term
The first term, , represents the probability of obtaining a permutation with i identity transpositions. This turns out to be binomial since for each iteration of the for loop:
The outcome is independent of the results before it, and
The probability of creating an identity transposition is the same, namely 1/N.
Thus for N trials, the probability of obtaining i identity transpositions is:
The Second Term
So if you've made it this far, we have reduced the problem to finding for N - i even. This represents the probability of obtaining an identity permutation given i of the transpositions are identities. I use a naive counting approach to determine the number of ways of achieving the identity permutation over the number of possible permutations.
First consider the permutations (n, m) and (m, n) equivalent. Then, let M be the number of non-identity permutations possible. We will use this quantity frequently.
The goal here is to determine the number of ways a collections of transpositions can be combined to form the identity permutation. I will try to construct the general solution along side an example of N = 4.
Let's consider the N = 4 case with all identity transpositions (i.e. i = N = 4). Let X represent an identity transposition. For each X, there are N possibilities (they are: n = m = 0, 1, 2, ... , N - 1). Thus there are N^i = 4^4 possibilities for achieving the identity permutation. For completeness, we add the binomial coefficient, C(N, i), to consider ordering of the identity transpositions (here it just equals 1). I've tried to depict this below with the physical layout of elements above and the number of possibilities below:
I = _X_ _X_ _X_ _X_
N * N * N * N * C(4, 4) => N^N * C(N, N) possibilities
Now without explicitly substituting N = 4 and i = 4, we can look at the general case. Combining the above with the denominator found previously, we find:
This is intuitive. In fact, any other value other than 1 should probably alarm you. Think about it: we are given the situation in which all N transpositions are said to be identities. What's the probably that the array is unchanged in this situation? Clearly, 1.
Now, again for N = 4, let's consider 2 identity transpositions (i.e. i = N - 2 = 2). As a convention, we will place the two identities at the end (and account for ordering later). We know now that we need to pick two transpositions which, when composed, will become the identity permutation. Let's place any element in the first location, call it t1. As stated above, there are M possibilities supposing t1 is not an identity (it can't be as we have already placed two).
I = _t1_ ___ _X_ _X_
M * ? * N * N
The only element left that could possibly go in the second spot is the inverse of t1, which is in fact t1 (and this is the only one by uniqueness of inverse). We again include the binomial coefficient: in this case we have 4 open locations and we are looking to place 2 identity permutations. How many ways can we do that? 4 choose 2.
I = _t1_ _t1_ _X_ _X_
M * 1 * N * N * C(4, 2) => C(N, N-2) * M * N^(N-2) possibilities
Again looking at the general case, this all corresponds to:
Finally we do the N = 4 case with no identity transpositions (i.e. i = N - 4 = 0). Since there are a lot of possibilities, it starts to get tricky and we must be careful not to double count. We start similarly by placing a single element in the first spot and working out possible combinations. Take the easiest first: the same transposition 4 times.
I = _t1_ _t1_ _t1_ _t1_
M * 1 * 1 * 1 => M possibilities
Let's now consider two unique elements t1 and t2. There are M possibilities for t1 and only M-1 possibilities for t2 (since t2 cannot be equal to t1). If we exhaust all arrangements, we are left with the following patterns:
I = _t1_ _t1_ _t2_ _t2_
M * 1 * M-1 * 1 => M * (M - 1) possibilities (1)st
= _t1_ _t2_ _t1_ _t2_
M * M-1 * 1 * 1 => M * (M - 1) possibilities (2)nd
= _t1_ _t2_ _t2_ _t1_
M * M-1 * 1 * 1 => M * (M - 1) possibilities (3)rd
Now let's consider three unique elements, t1, t2, t3. Let's place t1 first and then t2. As usual, we have:
I = _t1_ _t2_ ___ ___
M * ? * ? * ?
We can't yet say how many possible t2s there can be yet, and we will see why in a minute.
We now place t1 in the third spot. Notice, t1 must go there since if were to go in the last spot, we would just be recreating the (3)rd arrangement above. Double counting is bad! This leaves the third unique element t3 to the final position.
I = _t1_ _t2_ _t1_ _t3_
M * ? * 1 * ?
So why did we have to take a minute to consider the number of t2s more closely? The transpositions t1 and t2 cannot be disjoint permutations (i.e. they must share one (and only one since they also cannot be equal) of their n or m). The reason for this is because if they were disjoint, we could swap the order of permutations. This means we would be double counting the (1)st arrangement.
Say t1 = (n, m). t2 must be of the form (n, x) or (y, m) for some x and y in order to be non-disjoint. Note that x may not be n or m and y many not be n or m. Thus, the number of possible permutations that t2 could be is actually 2 * (N - 2).
So, coming back to our layout:
I = _t1_ _t2_ _t1_ _t3_
M * 2(N-2) * 1 * ?
Now t3 must be the inverse of the composition of t1 t2 t1. Let's do it out manually:
(n, m)(n, x)(n, m) = (m, x)
Thus t3 must be (m, x). Note this is not disjoint to t1 and not equal to either t1 or t2 so there is no double counting for this case.
I = _t1_ _t2_ _t1_ _t3_
M * 2(N-2) * 1 * 1 => M * 2(N - 2) possibilities
Finally, putting all of these together:
4) Putting it all together
So that's it. Work backwards, substituting what we found into the original summation given in step 2. I computed the answer to the N = 4 case below. It matches the empirical number found in another answer very closely!
N = 4
M = 6 _________ _____________ _________
| Pr(K_i) | Pr(A | K_i) | Product |
_________|_________|_____________|_________|
| | | | |
| i = 0 | 0.316 | 120 / 1296 | 0.029 |
|_________|_________|_____________|_________|
| | | | |
| i = 2 | 0.211 | 6 / 36 | 0.035 |
|_________|_________|_____________|_________|
| | | | |
| i = 4 | 0.004 | 1 / 1 | 0.004 |
|_________|_________|_____________|_________|
| | |
| Sum: | 0.068 |
|_____________|_________|
Correctness
It would be cool if there was a result in group theory to apply here-- and maybe there is! It would certainly help make all this tedious counting go away completely (and shorten the problem to something much more elegant). I stopped working at N = 4. For N > 5, what is given only gives an approximation (how good, I'm not sure). It is pretty clear why that is if you think about it: for example, given N = 8 transpositions, there are clearly ways of creating the identity with four unique elements which are not accounted for above. The number of ways becomes seemingly more difficult to count as the permutation gets longer (as far as I can tell...).
Anyway, I definitely couldn't do something like this within the scope of an interview. I would get as far as the denominator step if I was lucky. Beyond that, it seems pretty nasty.
Very much curious to know why these people ask so strange questions on probability?
Questions like this are asked because they allow the interviewer to gain insight into the interviewee's
ability read code (very simple code but at least something)
ability to analyse an algorithm to identify execution path
skills at applying logic to find possible outcomes and edge case
reasoning and problem solving skills as they work through the problem
communication and work skills - do they ask questions, or work in isolation based on information at hand
... and so on. The key to having a question that exposes these attributes of the interviewee is to have a piece of code that is deceptively simple. This shakes out the imposters the non-coder is stuck; the arrogant jump to the wrong conclusion; the lazy or sub-par computer scientist finds a simple solution and stops looking. Often, as they say, it's not whether you get the right answer but whether you impress with your thought process.
I'll attempt to answer the question, too. In an interview I'd explain myself rather than provide a one-line written answer - this is because even if my 'answer' is wrong, I am able to demonstrate logical thinking.
A will remain the same - i.e. elements in the same positions - when
m == n in every iteration (so that every element only swaps with itself); or
any element that is swapped is swapped back to its original position
The first case is the 'simple' case that duedl0r gives, the case that the array isn't altered. This might be the answer, because
what is the probability that array A remains the same?
if the array changes at i = 1 and then reverts back at i = 2, it's in the original state but it didn't 'remain the same' - it was changed, and then changed back. That might be a smartass technicality.
Then considering the chance of elements being swapped and swapped back - I think that calculation is above my head in an interview. The obvious consideration is that that does not need to be a change - change back swap, there could just as easily be a swap between three elements, swapping 1 and 2, then 2 and 3, 1 and 3 and finally 2 and 3. And continuing, there could be swaps between 4, 5 or more items that are 'circular' like this.
In fact, rather than considering the cases where the array is unchanged, it may be simpler to consider the cases where it is changed. Consider whether this problem can be mapped onto a known structure like Pascal's triangle.
This is a hard problem. I agree that it's too hard to solve in an interview, but that doesn't mean it is too hard to ask in an interview. The poor candidate won't have an answer, the average candidate will guess the obvious answer, and the good candidate will explain why the problem is too hard to answer.
I consider this an 'open-ended' question that gives the interviewer insight into the candidate. For this reason, even though it's too hard to solve during an interview, it is a good question to ask during an interview. There's more to asking a question than just checking whether the answer is right or wrong.
Below is C code to count the number of values of the 2N-tuple of indices that rand can produce and calculate the probability. Starting with N = 0, it shows counts of 1, 1, 8, 135, 4480, 189125, and 12450816, with probabilities of 1, 1, .5, .185185, .0683594, .0193664, and .00571983. The counts do not appear in the Encyclopedia of Integer Sequences, so either my program has a bug or this is a very obscure problem. If so, the problem is not intended to be solved by a job applicant but to expose some of their thought processes and how they deal with frustration. I would not regard it as a good interview problem.
#include <inttypes.h>
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#define swap(a, b) do { int t = (a); (a) = (b); (b) = t; } while (0)
static uint64_t count(int n)
{
// Initialize count of how many times the original order is the result.
uint64_t c = 0;
// Allocate space for selectors and initialize them to zero.
int *r = calloc(2*n, sizeof *r);
// Allocate space for array to be swapped.
int *A = malloc(n * sizeof *A);
if (!A || !r)
{
fprintf(stderr, "Out of memory.\n");
exit(EXIT_FAILURE);
}
// Iterate through all values of selectors.
while (1)
{
// Initialize A to show original order.
for (int i = 0; i < n; ++i)
A[i] = i;
// Test current selector values by executing the swap sequence.
for (int i = 0; i < 2*n; i += 2)
{
int m = r[i+0];
int n = r[i+1];
swap(A[m], A[n]);
}
// If array is in original order, increment counter.
++c; // Assume all elements are in place.
for (int i = 0; i < n; ++i)
if (A[i] != i)
{
// If any element is out of place, cancel assumption and exit.
--c;
break;
}
// Increment the selectors, odometer style.
int i;
for (i = 0; i < 2*n; ++i)
// Stop when a selector increases without wrapping.
if (++r[i] < n)
break;
else
// Wrap this selector to zero and continue.
r[i] = 0;
// Exit the routine when the last selector wraps.
if (2*n <= i)
{
free(A);
free(r);
return c;
}
}
}
int main(void)
{
for (int n = 0; n < 7; ++n)
{
uint64_t c = count(n);
printf("N = %d: %" PRId64 " times, %g probabilty.\n",
n, c, c/pow(n, 2*n));
}
return 0;
}
The behaviour of the algorithm can be modelled as a Markov chain over the symmetric group SN.
Basics
The N elements of the array A can be arranged in N! different permutations. Let us number these permutations from 1 to N!, e.g. by lexicographic ordering. So the state of the array A at any time in the algorithm can be fully characterized by the permutation number.
For example, for N = 3, one possible numbering of all 3! = 6 permutations might be:
a b c
a c b
b a c
b c a
c a b
c b a
State transition probabilities
In each step of the algorithm, the state of A either stays the same or transitions from one of these permutations to another. We are now interested in the probabilities of these state changes. Let us call Pr(i → j) the probability that the state changes from permutation i to permutation j in a single loop iteration.
As we pick m and n uniformly and independently from the range [0, N-1], there are N² possible outcomes, each of which is equally likely.
Identity
For N of these outcomes m = n holds, so there is no change in the permutation. Therefore,
.
Transpositions
The remaining N² - N cases are transpositions, i.e. two elements exchange their positions and therefore the permutation changes. Suppose one of these transpositions exchanges the elements at positions x and y. There are two cases how this transposition can be generated by the algorithm: either m = x, n = y or m = y, n = x. Thus, the probability for each transposition is 2 / N².
How does this relate to our permutations? Let us call two permutations i and j neighbors if and only if there is a transposition which transforms i into j (and vice versa). We then can conclude:
Transition matrix
We can arrange the probabilities Pr(i → j) in a transition matrix P ∈ [0,1]N!×N!. We define
pij = Pr(i → j),
where pij is the entry in the i-th row and j-th column of P. Note that
Pr(i → j) = Pr(j → i),
which means P is symmetric.
The key point now is the observation of what happens when we multiply P by itself. Take any element p(2)ij of P²:
The product Pr(i → k) · Pr(k → j) is the probability that starting at permutation i we transition into permutation k in one step, and transition into permutation j after another subsequent step. Summing over all in-between permutations k therefore gives us the total probability of transitioning from i to j in 2 steps.
This argument can be extended to higher powers of P. A special consequence is the following:
p(N)ii is the probability of returning back to permutation i after N steps, assuming we started at permutation i.
Example
Let's play this through with N = 3. We already have a numbering for the permutations. The corresponding transition matrix is as follows:
Multiplying P with itself gives:
Another multiplication yields:
Any element of the main diagonal gives us the wanted probability, which is 15/81 or 5/27.
Discussion
While this approach is mathematically sound and can be applied to any value of N, it is not very practical in this form. The transition matrix P has N!² entries, which becomes huge very fast. Even for N = 10 the size of the matrix already exceeds 13 trillion entries. A naive implementation of this algorithm therefore appears to be infeasible.
However, in comparison to other proposals, this approach is very structured and doesn't require complex derivations beyond figuring out which permutations are neighbors. My hope is that this structuredness can be exploited to find a much simpler computation.
For example, one could exploit the fact that all diagonal elements of any power of P are equal. Assuming we can easily calculate the trace of PN, the solution is then simply tr(PN) / N!. The trace of PN is equal to the sum of the N-th powers of its eigenvalues. So if we had an efficient algorithm to compute the eigenvalues of P, we would be set. I haven't explored this further than calculating the eigenvalues up to N = 5, however.
It's easy to observe bounds 1/nn <= p <= 1/n.
Here is an incomplete idea of showing an inverse-exponential upper bound.
You're drawing numbers from {1,2,..,n} 2n times. If any of them is unique (occurs exactly once), the array will definitely be changed, as the element has gone away and cannot return at its original place.
The probability that a fixed number is unique is 2n * 1/n * (1-1/n)^(2n-1)=2 * (1-1/n)^(2n-1) which is asympotically 2/e2, bounded away from 0. [2n because you choose on which try you get it, 1/n that you got it on that try, (1-1/n)^(2n-1) that you did not get it on other tries]
If the events were independent, you'd get that chance that all numbers are nonunique is (2/e2)^n, which would mean p <= O((2/e2)^n). Unfortunately, they are not independent. I feel that the bound can be shown with more sophisticated analysis. The keyword is "balls and bins problem".
One simplistic solution is
p >= 1 / NN
Since one possible way the array stays the same is if m = n for every iteration. And m equals n with possibility 1 / N.
It's certainly higher than that. The question is by how much..
Second thought: One could also argue, that if you shuffle an array randomly, every permutation has equal probability. Since there are n! permutations the probability of getting just one (the one we have at the beginning) is
p = 1 / N!
which is a bit better than the previous result.
As discussed, the algorithm is biased. This means not every permutation has the same probability. So 1 / N! is not quite exact. You have to find out how the distribution of the permutations are.
FYI, not sure the bound above (1/n^2) holds:
N=5 -> 0.019648 < 1/25
N=6 -> 0.005716 < 1/36
Sampling code:
import random
def sample(times,n):
count = 0;
for i in range(times):
count += p(n)
return count*1.0/times;
def p(n):
perm = range(n);
for i in range(n):
a = random.randrange(n)
b = random.randrange(n)
perm[a],perm[b]=perm[b],perm[a];
return perm==range(n)
print sample(500000,5)
Everyone assumes that A[i] == i, which was not explicitly
stated. I'm going to make this assumption too, but note that the probability
depends on the contents. For example if A[i]=0, then the probability = 1 for
all N.
Here's how to do it. Let P(n,i) be the probability that the resulting array
differs by exactly i transpositions from the original array.
We want to know P(n,0). It's true that:
P(n,0) =
1/n * P(n-1,0) + 1/n^2 * P(n-1,1) =
1/n * P(n-1,0) + 1/n^2 * (1-1/(n-1)) * P(n-2,0)
Explanation: we can get the original array in two ways, either by making a "neutral" transposition in an array that's already good, or by reverting the only "bad" transposition. To get an array with only one "bad" transposition, we can take an array with 0 bad transpositions and make one transposition that is not neutral.
EDIT: -2 instead of -1 in P(n-1,0)
It's not a full solution, but it's something at least.
Take a particular set of swaps that have no effect. We know that it must have been the case that its swaps ended up forming a bunch of loops of different sizes, using a total of n swaps. (For the purposes of this, a swap with no effect can be considered a loop of size 1)
Perhaps we can
1) Break them down into groups based on what the sizes of the loops are
2) Calculate the number of ways to get each group.
(The main problem is that there are a ton of different groups, but I'm not sure how you'd actually calculate this if you don't take into account the different groupings.)
Interesting question.
I think the answer is 1/N, but I don't have any proof. When I find a proof, I will edit my answer.
What I got until now:
If m == n, You won't change the array.
The probability to get m == n is 1/N, because there are N^2 options, and only N is suitable ((i,i) for every 0 <= i <= N-1).
Thus, we get N/N^2 = 1/N.
Denote Pk the probability that after k iterations of swaps, the array of size N will remain the same.
P1 = 1/N. (As we saw below)
P2 = (1/N)P1 + (N-1/N)(2/N^2) = 1/N^2 + 2(N-1) / N^3.
Explanation for P2:
We want to calculate the probability that after 2 iterations, the array with
N elements won't change. We have 2 options :
- in the 2 iteration we got m == n (Probability of 1/N)
- in the 2 iteration we got m != n (Probability of N-1/N)
If m == n, we need that the array will remain after the 1 iteration = P1.
If m != n, we need that in the 1 iteration to choose the same n and m
(order is not important). So we get 2/N^2.
Because those events are independent we get - P2 = (1/N)*P1 + (N-1/N)*(2/N^2).
Pk = (1/N)*Pk-1 + (N-1/N)*X. (the first for m == n, the second for m != n)
I have to think more about what X equals. (X is just a replacement for the real formula, not a constant or anything else)
Example for N = 2.
All possible swaps:
(1 1 | 1 1),(1 1 | 1 2),(1 1 | 2 1),(1 1 | 2 2),(1 2 | 1 1),(1 2 | 1 2)
(1 2 | 2 1),(1 2 | 2 2),(2 1 | 1 1),(2 1 | 1 2),(2 1 | 2 1),(2 1 | 2 2)
(2 2 | 1 1),(2 2 | 1 2),(2 2 | 2 1),(2 1 | 1 1).
Total = 16. Exactly 8 of them remain the array the same.
Thus, for N = 2, the answer is 1/2.
EDIT :
I want to introduce another approach:
We can classify swaps to three groups: constructive swaps, destructive swaps and harmless swaps.
Constructive swap is defined to be a swap that cause at least one element to move to its right place.
Destructive swap is defined to be a swap that cause at least one element to move from its correct position.
Harmless swap is defined to be a swap that does not belong to the other groups.
It is easy to see that this is a partition of all possible swaps. (intersection = empty set).
Now the claim I want to prove:
The array will remain the same if and only if
the number of Destructive swap == Constructive swap in the iterations.
If someone has a counter-example, please write it down as a comment.
If this claim is correct, we can take all combinations and sum them -
0 harmless swaps, 1 harmless swaps,..,N harmless swaps.
And for each possible k harmless swap, we check if N-k is even, if no, we skip. If yes, we take (N-k)/2 for destructive, and (N-k) for constructive. And just look all possibilities.
I would model the problem as a multigraph where nodes are elements of the array and swaps is adding an un-directed(!) connection between them. Then look for loops somehow (all nodes is a part of a loop => original)
Really need to get back to work! :(
well, from mathematical perspective:
to have the array elements swapped at the same place every time, then the Rand(N) function must generate the same number twice for int m, and int n. so the probability that the Rand(N) function generates the same number twice is 1/N.
and we have Rand(N) called N times inside the for loop, so we have probability of 1/(N^2)
Naive implementation in C#.
The idea is to create all the possible permutations of initial array and enumerate them.
Then we build a matrix of possible changes of state. Multiplying matrix by itself N times we will get the matrix showing how many ways exists that lead from permutation #i to permutation #j in N steps. Elemet [0,0] will show how many ways will lead to the same initial state. Sum of elements of row #0 will show total number of different ways. By dividing former to latter we get the probability.
In fact total number of permutations is N^(2N).
Output:
For N=1 probability is 1 (1 / 1)
For N=2 probability is 0.5 (8 / 16)
For N=3 probability is 0.1851851851851851851851851852 (135 / 729)
For N=4 probability is 0.068359375 (4480 / 65536)
For N=5 probability is 0.0193664 (189125 / 9765625)
For N=6 probability is 0.0057198259072973293366526105 (12450816 / 2176782336)
class Program
{
static void Main(string[] args)
{
for (int i = 1; i < 7; i++)
{
MainClass mc = new MainClass(i);
mc.Run();
}
}
}
class MainClass
{
int N;
int M;
List<int> comb;
List<int> lastItemIdx;
public List<List<int>> combinations;
int[,] matrix;
public MainClass(int n)
{
N = n;
comb = new List<int>();
lastItemIdx = new List<int>();
for (int i = 0; i < n; i++)
{
comb.Add(-1);
lastItemIdx.Add(-1);
}
combinations = new List<List<int>>();
}
public void Run()
{
GenerateAllCombinations();
GenerateMatrix();
int[,] m2 = matrix;
for (int i = 0; i < N - 1; i++)
{
m2 = Multiply(m2, matrix);
}
decimal same = m2[0, 0];
decimal total = 0;
for (int i = 0; i < M; i++)
{
total += m2[0, i];
}
Console.WriteLine("For N={0} probability is {1} ({2} / {3})", N, same / total, same, total);
}
private int[,] Multiply(int[,] m2, int[,] m1)
{
int[,] ret = new int[M, M];
for (int ii = 0; ii < M; ii++)
{
for (int jj = 0; jj < M; jj++)
{
int sum = 0;
for (int k = 0; k < M; k++)
{
sum += m2[ii, k] * m1[k, jj];
}
ret[ii, jj] = sum;
}
}
return ret;
}
private void GenerateMatrix()
{
M = combinations.Count;
matrix = new int[M, M];
for (int i = 0; i < M; i++)
{
matrix[i, i] = N;
for (int j = i + 1; j < M; j++)
{
if (2 == Difference(i, j))
{
matrix[i, j] = 2;
matrix[j, i] = 2;
}
else
{
matrix[i, j] = 0;
}
}
}
}
private int Difference(int x, int y)
{
int ret = 0;
for (int i = 0; i < N; i++)
{
if (combinations[x][i] != combinations[y][i])
{
ret++;
}
if (ret > 2)
{
return int.MaxValue;
}
}
return ret;
}
private void GenerateAllCombinations()
{
int placeAt = 0;
bool doRun = true;
while (doRun)
{
doRun = false;
bool created = false;
for (int i = placeAt; i < N; i++)
{
for (int j = lastItemIdx[i] + 1; j < N; j++)
{
lastItemIdx[i] = j; // remember the test
if (comb.Contains(j))
{
continue; // tail items should be nulled && their lastItemIdx set to -1
}
// success
placeAt = i;
comb[i] = j;
created = true;
break;
}
if (comb[i] == -1)
{
created = false;
break;
}
}
if (created)
{
combinations.Add(new List<int>(comb));
}
// rollback
bool canGenerate = false;
for (int k = placeAt + 1; k < N; k++)
{
lastItemIdx[k] = -1;
}
for (int k = placeAt; k >= 0; k--)
{
placeAt = k;
comb[k] = -1;
if (lastItemIdx[k] == N - 1)
{
lastItemIdx[k] = -1;
continue;
}
canGenerate = true;
break;
}
doRun = canGenerate;
}
}
}
The probability that m==n on each iteration, then do that N times. P(m==n) = 1/N. So I think P=1/(n^2) for that case. But then you have to consider the values getting swapped back. So I think the answer is (text editor got me) 1/N^N.
Question: what is the probability that array A remains the same?
Condition: Assume that the array contains unique elements.
Tried the solution in Java.
Random swapping happens on primitive int array. In java method parameters are always passed by value so what happens in swap method does not matter as a[m] and a[n] elements of the array (from below code swap(a[m], a[n]) ) are passed not complete array.
The answer is array will remain same. Despite of condition mentioned above. See below java code sample:
import java.util.Random;
public class ArrayTrick {
int a[] = new int[10];
Random random = new Random();
public void swap(int i, int j) {
int temp = i;
i = j;
j = temp;
}
public void fillArray() {
System.out.println("Filling array: ");
for (int index = 0; index < a.length; index++) {
a[index] = random.nextInt(a.length);
}
}
public void swapArray() {
System.out.println("Swapping array: ");
for (int index = 0; index < a.length; index++) {
int m = random.nextInt(a.length);
int n = random.nextInt(a.length);
swap(a[m], a[n]);
}
}
public void printArray() {
System.out.println("Printing array: ");
for (int index = 0; index < a.length; index++) {
System.out.print(" " + a[index]);
}
System.out.println();
}
public static void main(String[] args) {
ArrayTrick at = new ArrayTrick();
at.fillArray();
at.printArray();
at.swapArray();
at.printArray();
}
}
Sample output:
Filling array:
Printing array:
3 1 1 4 9 7 9 5 9 5
Swapping array:
Printing array:
3 1 1 4 9 7 9 5 9 5

There is an array having 1 to 100 numbers randomly placed. But two numbers are missing from the list. What are those two numbers? [duplicate]

I had an interesting job interview experience a while back. The question started really easy:
Q1: We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once, so there are 100 numbers. Now one number is randomly picked out of the bag. Find the missing number.
I've heard this interview question before, of course, so I very quickly answered along the lines of:
A1: Well, the sum of the numbers 1 + 2 + 3 + … + N is (N+1)(N/2) (see Wikipedia: sum of arithmetic series). For N = 100, the sum is 5050.
Thus, if all numbers are present in the bag, the sum will be exactly 5050. Since one number is missing, the sum will be less than this, and the difference is that number. So we can find that missing number in O(N) time and O(1) space.
At this point I thought I had done well, but all of a sudden the question took an unexpected turn:
Q2: That is correct, but now how would you do this if TWO numbers are missing?
I had never seen/heard/considered this variation before, so I panicked and couldn't answer the question. The interviewer insisted on knowing my thought process, so I mentioned that perhaps we can get more information by comparing against the expected product, or perhaps doing a second pass after having gathered some information from the first pass, etc, but I really was just shooting in the dark rather than actually having a clear path to the solution.
The interviewer did try to encourage me by saying that having a second equation is indeed one way to solve the problem. At this point I was kind of upset (for not knowing the answer before hand), and asked if this is a general (read: "useful") programming technique, or if it's just a trick/gotcha answer.
The interviewer's answer surprised me: you can generalize the technique to find 3 missing numbers. In fact, you can generalize it to find k missing numbers.
Qk: If exactly k numbers are missing from the bag, how would you find it efficiently?
This was a few months ago, and I still couldn't figure out what this technique is. Obviously there's a Ω(N) time lower bound since we must scan all the numbers at least once, but the interviewer insisted that the TIME and SPACE complexity of the solving technique (minus the O(N) time input scan) is defined in k not N.
So the question here is simple:
How would you solve Q2?
How would you solve Q3?
How would you solve Qk?
Clarifications
Generally there are N numbers from 1..N, not just 1..100.
I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using O(N) bits in additional space. We can't afford any additional space proportional to N.
I'm also not looking for the obvious sort-first approach. This and the set-based approach are worth mentioning in an interview (they are easy to implement, and depending on N, can be very practical). I'm looking for the Holy Grail solution (which may or may not be practical to implement, but has the desired asymptotic characteristics nevertheless).
So again, of course you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N), and must then find the k missing numbers somehow.
Here's a summary of Dimitris Andreou's link.
Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations
a1 + a2 + ... + ak = b1
a12 + a22 + ... + ak2 = b2
...
a1k + a2k + ... + akk = bk
Using Newton's identities, knowing bi allows to compute
c1 = a1 + a2 + ... ak
c2 = a1a2 + a1a3 + ... + ak-1ak
...
ck = a1a2 ... ak
If you expand the polynomial (x-a1)...(x-ak) the coefficients will be exactly c1, ..., ck - see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means ai are uniquely determined, up to permutation.
This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.
However, when k is varying, the direct approach of computing c1,...,ck is prohibitely expensive, since e.g. ck is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Zq field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.
High level pseudocode for constant k:
Compute i-th powers of given numbers
Subtract to get sums of i-th powers of unknown numbers. Call the sums bi.
Use Newton's identities to compute coefficients from bi; call them ci. Basically, c1 = b1; c2 = (c1b1 - b2)/2; see Wikipedia for exact formulas
Factor the polynomial xk-c1xk-1 + ... + ck.
The roots of the polynomial are the needed numbers a1, ..., ak.
For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.
EDIT: The previous version of this answer stated that instead of Zq, where q is prime, it is possible to use a finite field of characteristic 2 (q=2^(log n)). This is not the case, since Newton's formulas require division by numbers up to k.
You will find it by reading the couple of pages of Muthukrishnan - Data Stream Algorithms: Puzzle 1: Finding Missing Numbers. It shows exactly the generalization you are looking for. Probably this is what your interviewer read and why he posed these questions.
Also see sdcvvc's directly related answer, which also includes pseudocode (hurray! no need to read those tricky math formulations :)) (thanks, great work!).
We can solve Q2 by summing both the numbers themselves, and the squares of the numbers.
We can then reduce the problem to
k1 + k2 = x
k1^2 + k2^2 = y
Where x and y are how far the sums are below the expected values.
Substituting gives us:
(x-k2)^2 + k2^2 = y
Which we can then solve to determine our missing numbers.
As #j_random_hacker pointed out, this is quite similar to Finding duplicates in O(n) time and O(1) space, and an adaptation of my answer there works here too.
Assuming that the "bag" is represented by a 1-based array A[] of size N - k, we can solve Qk in O(N) time and O(k) additional space.
First, we extend our array A[] by k elements, so that it is now of size N. This is the O(k) additional space. We then run the following pseudo-code algorithm:
for i := n - k + 1 to n
A[i] := A[1]
end for
for i := 1 to n - k
while A[A[i]] != A[i]
swap(A[i], A[A[i]])
end while
end for
for i := 1 to n
if A[i] != i then
print i
end if
end for
The first loop initialises the k extra entries to the same as the first entry in the array (this is just a convenient value that we know is already present in the array - after this step, any entries that were missing in the initial array of size N-k are still missing in the extended array).
The second loop permutes the extended array so that if element x is present at least once, then one of those entries will be at position A[x].
Note that although it has a nested loop, it still runs in O(N) time - a swap only occurs if there is an i such that A[i] != i, and each swap sets at least one element such that A[i] == i, where that wasn't true before. This means that the total number of swaps (and thus the total number of executions of the while loop body) is at most N-1.
The third loop prints those indexes of the array i that are not occupied by the value i - this means that i must have been missing.
I asked a 4-year-old to solve this problem. He sorted the numbers and then counted along. This has a space requirement of O(kitchen floor), and it works just as easy however many balls are missing.
Not sure, if it's the most efficient solution, but I would loop over all entries, and use a bitset to remember, which numbers are set, and then test for 0 bits.
I like simple solutions - and I even believe, that it might be faster than calculating the sum, or the sum of squares etc.
I haven't checked the maths, but I suspect that computing Σ(n^2) in the same pass as we compute Σ(n) would provide enough info to get two missing numbers, Do Σ(n^3) as well if there are three, and so on.
The problem with solutions based on sums of numbers is they don't take into account the cost of storing and working with numbers with large exponents... in practice, for it to work for very large n, a big numbers library would be used. We can analyse the space utilisation for these algorithms.
We can analyse the time and space complexity of sdcvvc and Dimitris Andreou's algorithms.
Storage:
l_j = ceil (log_2 (sum_{i=1}^n i^j))
l_j > log_2 n^j (assuming n >= 0, k >= 0)
l_j > j log_2 n \in \Omega(j log n)
l_j < log_2 ((sum_{i=1}^n i)^j) + 1
l_j < j log_2 (n) + j log_2 (n + 1) - j log_2 (2) + 1
l_j < j log_2 n + j + c \in O(j log n)`
So l_j \in \Theta(j log n)
Total storage used: \sum_{j=1}^k l_j \in \Theta(k^2 log n)
Space used: assuming that computing a^j takes ceil(log_2 j) time, total time:
t = k ceil(\sum_i=1^n log_2 (i)) = k ceil(log_2 (\prod_i=1^n (i)))
t > k log_2 (n^n + O(n^(n-1)))
t > k log_2 (n^n) = kn log_2 (n) \in \Omega(kn log n)
t < k log_2 (\prod_i=1^n i^i) + 1
t < kn log_2 (n) + 1 \in O(kn log n)
Total time used: \Theta(kn log n)
If this time and space is satisfactory, you can use a simple recursive
algorithm. Let b!i be the ith entry in the bag, n the number of numbers before
removals, and k the number of removals. In Haskell syntax...
let
-- O(1)
isInRange low high v = (v >= low) && (v <= high)
-- O(n - k)
countInRange low high = sum $ map (fromEnum . isInRange low high . (!)b) [1..(n-k)]
findMissing l low high krange
-- O(1) if there is nothing to find.
| krange=0 = l
-- O(1) if there is only one possibility.
| low=high = low:l
-- Otherwise total of O(knlog(n)) time
| otherwise =
let
mid = (low + high) `div` 2
klow = countInRange low mid
khigh = krange - klow
in
findMissing (findMissing low mid klow) (mid + 1) high khigh
in
findMising 1 (n - k) k
Storage used: O(k) for list, O(log(n)) for stack: O(k + log(n))
This algorithm is more intuitive, has the same time complexity, and uses less space.
A very simple solution to Q2 which I'm surprised nobody answered already. Use the method from Q1 to find the sum of the two missing numbers. Let's denote it by S, then one of the missing numbers is smaller than S/2 and the other is bigger than S/2 (duh). Sum all the numbers from 1 to S/2 and compare it to the formula's result (similarly to the method in Q1) to find the lower between the missing numbers. Subtract it from S to find the bigger missing number.
Wait a minute. As the question is stated, there are 100 numbers in the bag. No matter how big k is, the problem can be solved in constant time because you can use a set and remove numbers from the set in at most 100 - k iterations of a loop. 100 is constant. The set of remaining numbers is your answer.
If we generalise the solution to the numbers from 1 to N, nothing changes except N is not a constant, so we are in O(N - k) = O(N) time. For instance, if we use a bit set, we set the bits to 1 in O(N) time, iterate through the numbers, setting the bits to 0 as we go (O(N-k) = O(N)) and then we have the answer.
It seems to me that the interviewer was asking you how to print out the contents of the final set in O(k) time rather than O(N) time. Clearly, with a bit set, you have to iterate through all N bits to determine whether you should print the number or not. However, if you change the way the set is implemented you can print out the numbers in k iterations. This is done by putting the numbers into an object to be stored in both a hash set and a doubly linked list. When you remove an object from the hash set, you also remove it from the list. The answers will be left in the list which is now of length k.
To solve the 2 (and 3) missing numbers question, you can modify quickselect, which on average runs in O(n) and uses constant memory if partitioning is done in-place.
Partition the set with respect to a random pivot p into partitions l, which contain numbers smaller than the pivot, and r, which contain numbers greater than the pivot.
Determine which partitions the 2 missing numbers are in by comparing the pivot value to the size of each partition (p - 1 - count(l) = count of missing numbers in l and
n - count(r) - p = count of missing numbers in r)
a) If each partition is missing one number, then use the difference of sums approach to find each missing number.
(1 + 2 + ... + (p-1)) - sum(l) = missing #1 and
((p+1) + (p+2) ... + n) - sum(r) = missing #2
b) If one partition is missing both numbers and the partition is empty, then the missing numbers are either (p-1,p-2) or (p+1,p+2)
depending on which partition is missing the numbers.
If one partition is missing 2 numbers but is not empty, then recurse onto that partiton.
With only 2 missing numbers, this algorithm always discards at least one partition, so it retains O(n) average time complexity of quickselect. Similarly, with 3 missing numbers this algorithm also discards at least one partition with each pass (because as with 2 missing numbers, at most only 1 partition will contain multiple missing numbers). However, I'm not sure how much the performance decreases when more missing numbers are added.
Here's an implementation that does not use in-place partitioning, so this example does not meet the space requirement but it does illustrate the steps of the algorithm:
<?php
$list = range(1,100);
unset($list[3]);
unset($list[31]);
findMissing($list,1,100);
function findMissing($list, $min, $max) {
if(empty($list)) {
print_r(range($min, $max));
return;
}
$l = $r = [];
$pivot = array_pop($list);
foreach($list as $number) {
if($number < $pivot) {
$l[] = $number;
}
else {
$r[] = $number;
}
}
if(count($l) == $pivot - $min - 1) {
// only 1 missing number use difference of sums
print array_sum(range($min, $pivot-1)) - array_sum($l) . "\n";
}
else if(count($l) < $pivot - $min) {
// more than 1 missing number, recurse
findMissing($l, $min, $pivot-1);
}
if(count($r) == $max - $pivot - 1) {
// only 1 missing number use difference of sums
print array_sum(range($pivot + 1, $max)) - array_sum($r) . "\n";
} else if(count($r) < $max - $pivot) {
// mroe than 1 missing number recurse
findMissing($r, $pivot+1, $max);
}
}
Demo
For Q2 this is a solution that is a bit more inefficient than the others, but still has O(N) runtime and takes O(k) space.
The idea is to run the original algorithm two times. In the first one you get a total number which is missing, which gives you an upper bound of the missing numbers. Let's call this number N. You know that the missing two numbers are going to sum up to N, so the first number can only be in the interval [1, floor((N-1)/2)] while the second is going to be in [floor(N/2)+1,N-1].
Thus you loop on all numbers once again, discarding all numbers that are not included in the first interval. The ones that are, you keep track of their sum. Finally, you'll know one of the missing two numbers, and by extension the second.
I have a feeling that this method could be generalized and maybe multiple searches run in "parallel" during a single pass over the input, but I haven't yet figured out how.
Here's a solution that uses k bits of extra storage, without any clever tricks and just straightforward. Execution time O (n), extra space O (k). Just to prove that this can be solved without reading up on the solution first or being a genius:
void puzzle (int* data, int n, bool* extra, int k)
{
// data contains n distinct numbers from 1 to n + k, extra provides
// space for k extra bits.
// Rearrange the array so there are (even) even numbers at the start
// and (odd) odd numbers at the end.
int even = 0, odd = 0;
while (even + odd < n)
{
if (data [even] % 2 == 0) ++even;
else if (data [n - 1 - odd] % 2 == 1) ++odd;
else { int tmp = data [even]; data [even] = data [n - 1 - odd];
data [n - 1 - odd] = tmp; ++even; ++odd; }
}
// Erase the lowest bits of all numbers and set the extra bits to 0.
for (int i = even; i < n; ++i) data [i] -= 1;
for (int i = 0; i < k; ++i) extra [i] = false;
// Set a bit for every number that is present
for (int i = 0; i < n; ++i)
{
int tmp = data [i];
tmp -= (tmp % 2);
if (i >= even) ++tmp;
if (tmp <= n) data [tmp - 1] += 1; else extra [tmp - n - 1] = true;
}
// Print out the missing ones
for (int i = 1; i <= n; ++i)
if (data [i - 1] % 2 == 0) printf ("Number %d is missing\n", i);
for (int i = n + 1; i <= n + k; ++i)
if (! extra [i - n - 1]) printf ("Number %d is missing\n", i);
// Restore the lowest bits again.
for (int i = 0; i < n; ++i) {
if (i < even) { if (data [i] % 2 != 0) data [i] -= 1; }
else { if (data [i] % 2 == 0) data [i] += 1; }
}
}
Motivation
If you want to solve the general-case problem, and you can store and edit the array, then Caf's solution is by far the most efficient. If you can't store the array (streaming version), then sdcvvc's answer is the only type of solution currently suggested.
The solution I propose is the most efficient answer (so far on this thread) if you can store the array but can't edit it, and I got the idea from Svalorzen's solution, which solves for 1 or 2 missing items. This solution takes Θ(k*n) time and O(min(k,log(n))) and Ω(log(k)) space. It also works well with parallelism.
Concept
The idea is that if you use the original approach of comparing sums:
sum = SumOf(1,n) - SumOf(array)
... then you take the average of the missing numbers:
average = sum/n_missing_numbers
... which provides a boundary: Of the missing numbers, there's guaranteed to be at least one number less-or-equal to average, and at least one number greater than average. This means that we can split into sub problems that each scan the array [O(n)] and are only concerned with their respective sub-arrays.
Code
C-style solution (don't judge me for the global variables, I'm just trying to make the code readable for non-c folks):
#include "stdio.h"
// Example problem:
const int array [] = {0, 7, 3, 1, 5};
const int N = 8; // size of original array
const int array_size = 5;
int SumOneTo (int n)
{
return n*(n-1)/2; // non-inclusive
}
int MissingItems (const int begin, const int end, int & average)
{
// We consider only sub-array elements with values, v:
// begin <= v < end
// Initialise info about missing elements.
// First assume all are missing:
int n = end - begin;
int sum = SumOneTo(end) - SumOneTo(begin);
// Minus everything that we see (ie not missing):
for (int i = 0; i < array_size; ++i)
{
if ((begin <= array[i]) && (array[i] < end))
{
--n;
sum -= array[i];
}
}
// used by caller:
average = sum/n;
return n;
}
void Find (const int begin, const int end)
{
int average;
if (MissingItems(begin, end, average) == 1)
{
printf(" %d", average); // average(n) is same as n
return;
}
Find(begin, average + 1); // at least one missing here
Find(average + 1, end); // at least one here also
}
int main ()
{
printf("Missing items:");
Find(0, N);
printf("\n");
}
Analysis
Ignoring recursion for a moment, each function call clearly takes O(n) time and O(1) space. Note that sum can equal as much as n(n-1)/2, so requires double the amount of bits needed to store n-1. At most this means than we effectively need two extra elements worth of space, regardless of the size of the array or k, hence it's still O(1) space under the normal conventions.
It's not so obvious how many function calls there are for k missing elements, so I'll provide a visual. Your original sub-array (connected array) is the full array, which has all k missing elements in it. We'll imagine them in increasing order, where -- represent connections (part of same sub-array):
m1 -- m2 -- m3 -- m4 -- (...) -- mk-1 -- mk
The effect of the Find function is to disconnect the missing elements into different non-overlapping sub-arrays. It guarantees that there's at least one missing element in each sub-array, which means breaking exactly one connection.
What this means is that regardless of how the splits occur, it will always take k-1 Find function calls to do the work of finding the sub-arrays that have only one missing element in it.
So the time complexity is Θ((k-1 + k) * n) = Θ(k*n).
For the space complexity, if we divide proportionally each time then we get O(log(k)) space complexity, but if we only separate one at a time it gives us O(k).
See here for a proof as to why the space complexity is O(log(n)). Given that above we've shown that it's also O(k), then we know that it's O(min(k,log(n))).
May be this algorithm can work for question 1:
Precompute xor of first 100 integers(val=1^2^3^4....100)
xor the elements as they keep coming from input stream ( val1=val1^next_input)
final answer=val^val1
Or even better:
def GetValue(A)
val=0
for i=1 to 100
do
val=val^i
done
for value in A:
do
val=val^value
done
return val
This algorithm can in fact be expanded for two missing numbers. The first step remains the same. When we call GetValue with two missing numbers the result will be a a1^a2 are the two missing numbers. Lets say
val = a1^a2
Now to sieve out a1 and a2 from val we take any set bit in val. Lets say the ith bit is set in val. That means that a1 and a2 have different parity at ith bit position.
Now we do another iteration on the original array and keep two xor values. One for the numbers which have the ith bit set and other which doesn't have the ith bit set. We now have two buckets of numbers, and its guranteed that a1 and a2 will lie in different buckets. Now repeat the same what we did for finding one missing element on each of the bucket.
There is a general way to solve streaming problems like this.
The idea is to use a bit of randomization to hopefully 'spread' the k elements into independent sub problems, where our original algorithm solves the problem for us. This technique is used in sparse signal reconstruction, among other things.
Make an array, a, of size u = k^2.
Pick any universal hash function, h : {1,...,n} -> {1,...,u}. (Like multiply-shift)
For each i in 1, ..., n increase a[h(i)] += i
For each number x in the input stream, decrement a[h(x)] -= x.
If all of the missing numbers have been hashed to different buckets, the non-zero elements of the array will now contain the missing numbers.
The probability that a particular pair is sent to the same bucket, is less than 1/u by definition of a universal hash function. Since there are about k^2/2 pairs, we have that the error probability is at most k^2/2/u=1/2. That is, we succeed with probability at least 50%, and if we increase u we increase our chances.
Notice that this algorithm takes k^2 logn bits of space (We need logn bits per array bucket.) This matches the space required by #Dimitris Andreou's answer (In particular the space requirement of polynomial factorization, which happens to also be randomized.)
This algorithm also has constant time per update, rather than time k in the case of power-sums.
In fact, we can be even more efficient than the power sum method by using the trick described in the comments.
Can you check if every number exists? If yes you may try this:
S = sum of all numbers in the bag (S < 5050)
Z = sum of the missing numbers 5050 - S
if the missing numbers are x and y then:
x = Z - y and
max(x) = Z - 1
So you check the range from 1 to max(x) and find the number
You can solve Q2 if you have the sum of both lists and the product of both lists.
(l1 is the original, l2 is the modified list)
d = sum(l1) - sum(l2)
m = mul(l1) / mul(l2)
We can optimise this since the sum of an arithmetic series is n times the average of the first and last terms:
n = len(l1)
d = (n/2)*(n+1) - sum(l2)
Now we know that (if a and b are the removed numbers):
a + b = d
a * b = m
So we can rearrange to:
a = s - b
b * (s - b) = m
And multiply out:
-b^2 + s*b = m
And rearrange so the right side is zero:
-b^2 + s*b - m = 0
Then we can solve with the quadratic formula:
b = (-s + sqrt(s^2 - (4*-1*-m)))/-2
a = s - b
Sample Python 3 code:
from functools import reduce
import operator
import math
x = list(range(1,21))
sx = (len(x)/2)*(len(x)+1)
x.remove(15)
x.remove(5)
mul = lambda l: reduce(operator.mul,l)
s = sx - sum(x)
m = mul(range(1,21)) / mul(x)
b = (-s + math.sqrt(s**2 - (-4*(-m))))/-2
a = s - b
print(a,b) #15,5
I do not know the complexity of the sqrt, reduce and sum functions so I cannot work out the complexity of this solution (if anyone does know please comment below.)
Here is a solution that doesn't rely on complex math as sdcvvc's/Dimitris Andreou's answers do, doesn't change the input array as caf and Colonel Panic did, and doesn't use the bitset of enormous size as Chris Lercher, JeremyP and many others did. Basically, I began with Svalorzen's/Gilad Deutch's idea for Q2, generalized it to the common case Qk and implemented in Java to prove that the algorithm works.
The idea
Suppose we have an arbitrary interval I of which we only know that it contains at least one of the missing numbers. After one pass through the input array, looking only at the numbers from I, we can obtain both the sum S and the quantity Q of missing numbers from I. We do this by simply decrementing I's length each time we encounter a number from I (for obtaining Q) and by decreasing pre-calculated sum of all numbers in I by that encountered number each time (for obtaining S).
Now we look at S and Q. If Q = 1, it means that then I contains only one of the missing numbers, and this number is clearly S. We mark I as finished (it is called "unambiguous" in the program) and leave it out from further consideration. On the other hand, if Q > 1, we can calculate the average A = S / Q of missing numbers contained in I. As all numbers are distinct, at least one of such numbers is strictly less than A and at least one is strictly greater than A. Now we split I in A into two smaller intervals each of which contains at least one missing number. Note that it doesn't matter to which of the intervals we assign A in case it is an integer.
We make the next array pass calculating S and Q for each of the intervals separately (but in the same pass) and after that mark intervals with Q = 1 and split intervals with Q > 1. We continue this process until there are no new "ambiguous" intervals, i.e. we have nothing to split because each interval contains exactly one missing number (and we always know this number because we know S). We start out from the sole "whole range" interval containing all possible numbers (like [1..N] in the question).
Time and space complexity analysis
The total number of passes p we need to make until the process stops is never greater than the missing numbers count k. The inequality p <= k can be proved rigorously. On the other hand, there is also an empirical upper bound p < log2N + 3 that is useful for large values of k. We need to make a binary search for each number of the input array to determine the interval to which it belongs. This adds the log k multiplier to the time complexity.
In total, the time complexity is O(N ᛫ min(k, log N) ᛫ log k). Note that for large k, this is significantly better than that of sdcvvc/Dimitris Andreou's method, which is O(N ᛫ k).
For its work, the algorithm requires O(k) additional space for storing at most k intervals, that is significantly better than O(N) in "bitset" solutions.
Java implementation
Here's a Java class that implements the above algorithm. It always returns a sorted array of missing numbers. Besides that, it doesn't require the missing numbers count k because it calculates it in the first pass. The whole range of numbers is given by the minNumber and maxNumber parameters (e.g. 1 and 100 for the first example in the question).
public class MissingNumbers {
private static class Interval {
boolean ambiguous = true;
final int begin;
int quantity;
long sum;
Interval(int begin, int end) { // begin inclusive, end exclusive
this.begin = begin;
quantity = end - begin;
sum = quantity * ((long)end - 1 + begin) / 2;
}
void exclude(int x) {
quantity--;
sum -= x;
}
}
public static int[] find(int minNumber, int maxNumber, NumberBag inputBag) {
Interval full = new Interval(minNumber, ++maxNumber);
for (inputBag.startOver(); inputBag.hasNext();)
full.exclude(inputBag.next());
int missingCount = full.quantity;
if (missingCount == 0)
return new int[0];
Interval[] intervals = new Interval[missingCount];
intervals[0] = full;
int[] dividers = new int[missingCount];
dividers[0] = minNumber;
int intervalCount = 1;
while (true) {
int oldCount = intervalCount;
for (int i = 0; i < oldCount; i++) {
Interval itv = intervals[i];
if (itv.ambiguous)
if (itv.quantity == 1) // number inside itv uniquely identified
itv.ambiguous = false;
else
intervalCount++; // itv will be split into two intervals
}
if (oldCount == intervalCount)
break;
int newIndex = intervalCount - 1;
int end = maxNumber;
for (int oldIndex = oldCount - 1; oldIndex >= 0; oldIndex--) {
// newIndex always >= oldIndex
Interval itv = intervals[oldIndex];
int begin = itv.begin;
if (itv.ambiguous) {
// split interval itv
// use floorDiv instead of / because input numbers can be negative
int mean = (int)Math.floorDiv(itv.sum, itv.quantity) + 1;
intervals[newIndex--] = new Interval(mean, end);
intervals[newIndex--] = new Interval(begin, mean);
} else
intervals[newIndex--] = itv;
end = begin;
}
for (int i = 0; i < intervalCount; i++)
dividers[i] = intervals[i].begin;
for (inputBag.startOver(); inputBag.hasNext();) {
int x = inputBag.next();
// find the interval to which x belongs
int i = java.util.Arrays.binarySearch(dividers, 0, intervalCount, x);
if (i < 0)
i = -i - 2;
Interval itv = intervals[i];
if (itv.ambiguous)
itv.exclude(x);
}
}
assert intervalCount == missingCount;
for (int i = 0; i < intervalCount; i++)
dividers[i] = (int)intervals[i].sum;
return dividers;
}
}
For fairness, this class receives input in form of NumberBag objects. NumberBag doesn't allow array modification and random access and also counts how many times the array was requested for sequential traversing. It is also more suitable for large array testing than Iterable<Integer> because it avoids boxing of primitive int values and allows wrapping a part of a large int[] for a convenient test preparation. It is not hard to replace, if desired, NumberBag by int[] or Iterable<Integer> type in the find signature, by changing two for-loops in it into foreach ones.
import java.util.*;
public abstract class NumberBag {
private int passCount;
public void startOver() {
passCount++;
}
public final int getPassCount() {
return passCount;
}
public abstract boolean hasNext();
public abstract int next();
// A lightweight version of Iterable<Integer> to avoid boxing of int
public static NumberBag fromArray(int[] base, int fromIndex, int toIndex) {
return new NumberBag() {
int index = toIndex;
public void startOver() {
super.startOver();
index = fromIndex;
}
public boolean hasNext() {
return index < toIndex;
}
public int next() {
if (index >= toIndex)
throw new NoSuchElementException();
return base[index++];
}
};
}
public static NumberBag fromArray(int[] base) {
return fromArray(base, 0, base.length);
}
public static NumberBag fromIterable(Iterable<Integer> base) {
return new NumberBag() {
Iterator<Integer> it;
public void startOver() {
super.startOver();
it = base.iterator();
}
public boolean hasNext() {
return it.hasNext();
}
public int next() {
return it.next();
}
};
}
}
Tests
Simple examples demonstrating the usage of these classes are given below.
import java.util.*;
public class SimpleTest {
public static void main(String[] args) {
int[] input = { 7, 1, 4, 9, 6, 2 };
NumberBag bag = NumberBag.fromArray(input);
int[] output = MissingNumbers.find(1, 10, bag);
System.out.format("Input: %s%nMissing numbers: %s%nPass count: %d%n",
Arrays.toString(input), Arrays.toString(output), bag.getPassCount());
List<Integer> inputList = new ArrayList<>();
for (int i = 0; i < 10; i++)
inputList.add(2 * i);
Collections.shuffle(inputList);
bag = NumberBag.fromIterable(inputList);
output = MissingNumbers.find(0, 19, bag);
System.out.format("%nInput: %s%nMissing numbers: %s%nPass count: %d%n",
inputList, Arrays.toString(output), bag.getPassCount());
// Sieve of Eratosthenes
final int MAXN = 1_000;
List<Integer> nonPrimes = new ArrayList<>();
nonPrimes.add(1);
int[] primes;
int lastPrimeIndex = 0;
while (true) {
primes = MissingNumbers.find(1, MAXN, NumberBag.fromIterable(nonPrimes));
int p = primes[lastPrimeIndex]; // guaranteed to be prime
int q = p;
for (int i = lastPrimeIndex++; i < primes.length; i++) {
q = primes[i]; // not necessarily prime
int pq = p * q;
if (pq > MAXN)
break;
nonPrimes.add(pq);
}
if (q == p)
break;
}
System.out.format("%nSieve of Eratosthenes. %d primes up to %d found:%n",
primes.length, MAXN);
for (int i = 0; i < primes.length; i++)
System.out.format(" %4d%s", primes[i], (i % 10) < 9 ? "" : "\n");
}
}
Large array testing can be performed this way:
import java.util.*;
public class BatchTest {
private static final Random rand = new Random();
public static int MIN_NUMBER = 1;
private final int minNumber = MIN_NUMBER;
private final int numberCount;
private final int[] numbers;
private int missingCount;
public long finderTime;
public BatchTest(int numberCount) {
this.numberCount = numberCount;
numbers = new int[numberCount];
for (int i = 0; i < numberCount; i++)
numbers[i] = minNumber + i;
}
private int passBound() {
int mBound = missingCount > 0 ? missingCount : 1;
int nBound = 34 - Integer.numberOfLeadingZeros(numberCount - 1); // ceil(log_2(numberCount)) + 2
return Math.min(mBound, nBound);
}
private void error(String cause) {
throw new RuntimeException("Error on '" + missingCount + " from " + numberCount + "' test, " + cause);
}
// returns the number of times the input array was traversed in this test
public int makeTest(int missingCount) {
this.missingCount = missingCount;
// numbers array is reused when numberCount stays the same,
// just Fisher–Yates shuffle it for each test
for (int i = numberCount - 1; i > 0; i--) {
int j = rand.nextInt(i + 1);
if (i != j) {
int t = numbers[i];
numbers[i] = numbers[j];
numbers[j] = t;
}
}
final int bagSize = numberCount - missingCount;
NumberBag inputBag = NumberBag.fromArray(numbers, 0, bagSize);
finderTime -= System.nanoTime();
int[] found = MissingNumbers.find(minNumber, minNumber + numberCount - 1, inputBag);
finderTime += System.nanoTime();
if (inputBag.getPassCount() > passBound())
error("too many passes (" + inputBag.getPassCount() + " while only " + passBound() + " allowed)");
if (found.length != missingCount)
error("wrong result length");
int j = bagSize; // "missing" part beginning in numbers
Arrays.sort(numbers, bagSize, numberCount);
for (int i = 0; i < missingCount; i++)
if (found[i] != numbers[j++])
error("wrong result array, " + i + "-th element differs");
return inputBag.getPassCount();
}
public static void strideCheck(int numberCount, int minMissing, int maxMissing, int step, int repeats) {
BatchTest t = new BatchTest(numberCount);
System.out.println("╠═══════════════════════╬═════════════════╬═════════════════╣");
for (int missingCount = minMissing; missingCount <= maxMissing; missingCount += step) {
int minPass = Integer.MAX_VALUE;
int passSum = 0;
int maxPass = 0;
t.finderTime = 0;
for (int j = 1; j <= repeats; j++) {
int pCount = t.makeTest(missingCount);
if (pCount < minPass)
minPass = pCount;
passSum += pCount;
if (pCount > maxPass)
maxPass = pCount;
}
System.out.format("║ %9d %9d ║ %2d %5.2f %2d ║ %11.3f ║%n", missingCount, numberCount, minPass,
(double)passSum / repeats, maxPass, t.finderTime * 1e-6 / repeats);
}
}
public static void main(String[] args) {
System.out.println("╔═══════════════════════╦═════════════════╦═════════════════╗");
System.out.println("║ Number count ║ Passes ║ Average time ║");
System.out.println("║ missimg total ║ min avg max ║ per search (ms) ║");
long time = System.nanoTime();
strideCheck(100, 0, 100, 1, 20_000);
strideCheck(100_000, 2, 99_998, 1_282, 15);
MIN_NUMBER = -2_000_000_000;
strideCheck(300_000_000, 1, 10, 1, 1);
time = System.nanoTime() - time;
System.out.println("╚═══════════════════════╩═════════════════╩═════════════════╝");
System.out.format("%nSuccess. Total time: %.2f s.%n", time * 1e-9);
}
}
Try them out on Ideone
I think this can be done without any complex mathematical equations and theories. Below is a proposal for an in place and O(2n) time complexity solution:
Input form assumptions :
# of numbers in bag = n
# of missing numbers = k
The numbers in the bag are represented by an array of length n
Length of input array for the algo = n
Missing entries in the array (numbers taken out of the bag) are replaced by the value of the first element in the array.
Eg. Initially bag looks like [2,9,3,7,8,6,4,5,1,10].
If 4 is taken out, value of 4 will become 2 (the first element of the array).
Therefore after taking 4 out the bag will look like [2,9,3,7,8,6,2,5,1,10]
The key to this solution is to tag the INDEX of a visited number by negating the value at that INDEX as the array is traversed.
IEnumerable<int> GetMissingNumbers(int[] arrayOfNumbers)
{
List<int> missingNumbers = new List<int>();
int arrayLength = arrayOfNumbers.Length;
//First Pass
for (int i = 0; i < arrayLength; i++)
{
int index = Math.Abs(arrayOfNumbers[i]) - 1;
if (index > -1)
{
arrayOfNumbers[index] = Math.Abs(arrayOfNumbers[index]) * -1; //Marking the visited indexes
}
}
//Second Pass to get missing numbers
for (int i = 0; i < arrayLength; i++)
{
//If this index is unvisited, means this is a missing number
if (arrayOfNumbers[i] > 0)
{
missingNumbers.Add(i + 1);
}
}
return missingNumbers;
}
Thanks for this very interesting question:
It's because you reminded me Newton's work which really can solve this problem
Please refer Newton's Identities
As number of variables to find = number of equations (must for consistency)
I believe for this we should raise power to bag numbers so as to create number of different equations.
I don't know but, I believe if there should a function say f for which we'll add f( xi )
x1 + x2 + ... + xk = z1
x12 + x22 + ... + xk2 = z2
............
............
............
x1k + x2k + ... + xkk = zk
rest is a mathematical work not sure about time and space complexity but Newton's Identities will surely play important role.
Can't we use set theory
.difference_update() or Is there any chance of Linear Algebra in this question method?
You'd probably need clarification on what O(k) means.
Here's a trivial solution for arbitrary k: for each v in your set of numbers, accumulate the sum of 2^v. At the end, loop i from 1 to N. If sum bitwise ANDed with 2^i is zero, then i is missing. (Or numerically, if floor of the sum divided by 2^i is even. Or sum modulo 2^(i+1)) < 2^i.)
Easy, right? O(N) time, O(1) storage, and it supports arbitrary k.
Except that you're computing enormous numbers that on a real computer would each require O(N) space. In fact, this solution is identical to a bit vector.
So you could be clever and compute the sum and the sum of squares and the sum of cubes... up to the sum of v^k, and do the fancy math to extract the result. But those are big numbers too, which begs the question: what abstract model of operation are we talking about? How much fits in O(1) space, and how long does it take to sum up numbers of whatever size you need?
I have read all thirty answers and found the simplest one i.e to use a bit array of 100 to be the best. But as the question said we can't use an array of size N, I would use O(1) space complexity and k iterations i.e O(NK) time complexity to solve this.
To make the explanation simpler, consider I have been given numbers from 1 to 15 and two of them are missing i.e 9 and 14 but I don't know. Let the bag look like this:
[8,1,2,12,4,7,5,10,11,13,15,3,6].
We know that each number is represented internally in the form of bits.
For numbers till 16 we only need 4 bits. For numbers till 10^9, we will need 32 bits. But let's focus on 4 bits and then later we can generalize it.
Now, assume if we had all the numbers from 1 to 15, then internally, we would have numbers like this (if we had them ordered):
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
But now we have two numbers missing. So our representation will look something like this (shown ordered for understanding but can be in any order):
(2MSD|2LSD)
00|01
00|10
00|11
-----
01|00
01|01
01|10
01|11
-----
10|00
missing=(10|01)
10|10
10|11
-----
11|00
11|01
missing=(11|10)
11|11
Now let's make a bit array of size 2 that holds the count of numbers with corresponding 2 most significant digits. i.e
= [__,__,__,__]
00,01,10,11
Scan the bag from left and right and fill the above array such that each of bin of bit array contains the count of numbers. The result will be as under:
= [ 3, 4, 3, 3]
00,01,10,11
If all the numbers would have been present, it would have looked like this:
= [ 3, 4, 4, 4]
00,01,10,11
Thus we know that there are two numbers missing: one whose most 2 significant digits are 10 and one whose most 2 significant bits are 11. Now scan the list again and fill out a bit array of size 2 for the lower 2 significant digits. This time, only consider elements whose most 2 significant digits are 10. We will have the bit array as:
= [ 1, 0, 1, 1]
00,01,10,11
If all numbers of MSD=10 were present, we would have 1 in all the bins but now we see that one is missing. Thus we have the number whose MSD=10 and LSD=01 is missing which is 1001 i.e 9.
Similarly, if we scan again but consider only elements whose MSD=11,we get MSD=11 and LSD=10 missing which is 1110 i.e 14.
= [ 1, 0, 1, 1]
00,01,10,11
Thus, we can find the missing numbers in a constant amount of space. We can generalize this for 100, 1000 or 10^9 or any set of numbers.
References: Problem 1.6 in http://users.ece.utexas.edu/~adnan/afi-samples-new.pdf
Very nice problem. I'd go for using a set difference for Qk. A lot of programming languages even have support for it, like in Ruby:
missing = (1..100).to_a - bag
It's probably not the most efficient solution but it's one I would use in real life if I was faced with such a task in this case (known boundaries, low boundaries). If the set of number would be very large then I would consider a more efficient algorithm, of course, but until then the simple solution would be enough for me.
You could try using a Bloom Filter. Insert each number in the bag into the bloom, then iterate over the complete 1-k set until reporting each one not found. This may not find the answer in all scenarios, but might be a good enough solution.
I'd take a different approach to that question and probe the interviewer for more details about the larger problem he's trying to solve. Depending on the problem and the requirements surrounding it, the obvious set-based solution might be the right thing and the generate-a-list-and-pick-through-it-afterward approach might not.
For example, it might be that the interviewer is going to dispatch n messages and needs to know the k that didn't result in a reply and needs to know it in as little wall clock time as possible after the n-kth reply arrives. Let's also say that the message channel's nature is such that even running at full bore, there's enough time to do some processing between messages without having any impact on how long it takes to produce the end result after the last reply arrives. That time can be put to use inserting some identifying facet of each sent message into a set and deleting it as each corresponding reply arrives. Once the last reply has arrived, the only thing to be done is to remove its identifier from the set, which in typical implementations takes O(log k+1). After that, the set contains the list of k missing elements and there's no additional processing to be done.
This certainly isn't the fastest approach for batch processing pre-generated bags of numbers because the whole thing runs O((log 1 + log 2 + ... + log n) + (log n + log n-1 + ... + log k)). But it does work for any value of k (even if it's not known ahead of time) and in the example above it was applied in a way that minimizes the most critical interval.
This might sound stupid, but, in the first problem presented to you, you would have to see all the remaining numbers in the bag to actually add them up to find the missing number using that equation.
So, since you get to see all the numbers, just look for the number that's missing. The same goes for when two numbers are missing. Pretty simple I think. No point in using an equation when you get to see the numbers remaining in the bag.
You can motivate the solution by thinking about it in terms of symmetries (groups, in math language). No matter the order of the set of numbers, the answer should be the same. If you're going to use k functions to help determine the missing elements, you should be thinking about what functions have that property: symmetric. The function s_1(x) = x_1 + x_2 + ... + x_n is an example of a symmetric function, but there are others of higher degree. In particular, consider the elementary symmetric functions. The elementary symmetric function of degree 2 is s_2(x) = x_1 x_2 + x_1 x_3 + ... + x_1 x_n + x_2 x_3 + ... + x_(n-1) x_n, the sum of all products of two elements. Similarly for the elementary symmetric functions of degree 3 and higher. They are obviously symmetric. Furthermore, it turns out they are the building blocks for all symmetric functions.
You can build the elementary symmetric functions as you go by noting that s_2(x,x_(n+1)) = s_2(x) + s_1(x)(x_(n+1)). Further thought should convince you that s_3(x,x_(n+1)) = s_3(x) + s_2(x)(x_(n+1)) and so on, so they can be computed in one pass.
How do we tell which items were missing from the array? Think about the polynomial (z-x_1)(z-x_2)...(z-x_n). It evaluates to 0 if you put in any of the numbers x_i. Expanding the polynomial, you get z^n-s_1(x)z^(n-1)+ ... + (-1)^n s_n. The elementary symmetric functions appear here too, which is really no surprise, since the polynomial should stay the same if we apply any permutation to the roots.
So we can build the polynomial and try to factor it to figure out which numbers are not in the set, as others have mentioned.
Finally, if we are concerned about overflowing memory with large numbers (the nth symmetric polynomial will be of the order 100!), we can do these calculations mod p where p is a prime bigger than 100. In that case we evaluate the polynomial mod p and find that it again evaluates to 0 when the input is a number in the set, and it evaluates to a non-zero value when the input is a number not in the set. However, as others have pointed out, to get the values out of the polynomial in time that depends on k, not N, we have to factor the polynomial mod p.
I believe I have a O(k) time and O(log(k)) space algorithm, given that you have the floor(x) and log2(x) functions for arbitrarily big integers available:
You have an k-bit long integer (hence the log8(k) space) where you add the x^2, where x is the next number you find in the bag: s=1^2+2^2+... This takes O(N) time (which is not a problem for the interviewer). At the end you get j=floor(log2(s)) which is the biggest number you're looking for. Then s=s-j and you do again the above:
for (i = 0 ; i < k ; i++)
{
j = floor(log2(s));
missing[i] = j;
s -= j;
}
Now, you usually don't have floor and log2 functions for 2756-bit integers but instead for doubles. So? Simply, for each 2 bytes (or 1, or 3, or 4) you can use these functions to get the desired numbers, but this adds an O(N) factor to time complexity
Try to find the product of numbers from 1 to 50:
Let product, P1 = 1 x 2 x 3 x ............. 50
When you take out numbers one by one, multiply them so that you get the product P2. But two numbers are missing here, hence P2 < P1.
The product of the two mising terms, a x b = P1 - P2.
You already know the sum, a + b = S1.
From the above two equations, solve for a and b through a quadratic equation. a and b are your missing numbers.

Resources