Find the most common entry in an array - algorithm

You are given a 32-bit unsigned integer array with length up to 232, with the property that more than half of the entries in the array are equal to N, for some 32-bit unsigned integer N. Find N looking at each number in the array only once and using at most 2 kB of memory.
Your solution must be deterministic, and guaranteed to find N.

Keep one integer for each bit, and increment this collection appropriately for each integer in the array.
At the end, some of the bits will have a count higher than half the length of the array - those bits determine N. Of course, the count will be higher than the number of times N occurred, but that doesn't matter. The important thing is that any bit which isn't part of N cannot occur more than half the times (because N has over half the entries) and any bit which is part of N must occur more than half the times (because it will occur every time N occurs, and any extras).
(No code at the moment - about to lose net access. Hopefully the above is clear enough though.)

Boyer and Moore's "Linear Time Majority Vote Algorithm" - go down the array maintaining your current guess at the answer.

You can do this with only two variables.
public uint MostCommon(UInt32[] numberList)
{
uint suspect = 0;
int suspicionStrength = -1;
foreach (uint number in numberList)
{
if (number==suspect)
{
suspicionStrength++;
}
else
{
suspicionStrength--;
}
if (suspicionStrength<=0)
{
suspect = number;
}
}
return suspect;
}
Make the first number the suspect number, and continue looping through the list. If the number matches, increase the suspicion strength by one; if it doesn't match, lower the suspicion strength by one. If the suspicion strength hits 0 the current number becomes the suspect number. This will not work to find the most common number, only a number that is more than 50% of the group. Resist the urge to add a check if suspicionStrength is greater than half the list length - it will always result in more total comparisons.
P.S. I have not tested this code - use it at your own peril.

Pseudo code (notepad C++ :-)) for Jon's algorithm:
int lNumbers = (size_of(arrNumbers)/size_of(arrNumbers[0]);
for (int i = 0; i < lNumbers; i++)
for (int bi = 0; bi < 32; bi++)
arrBits[i] = arrBits[i] + (arrNumbers[i] & (1 << bi)) == (1 << bi) ? 1 : 0;
int N = 0;
for (int bc = 0; bc < 32; bc++)
if (arrBits[bc] > lNumbers/2)
N = N | (1 << bc);

Notice that if the sequence a0, a1, . . . , an−1 contains a leader, then after removing a pair of
elements of different values, the remaining sequence still has the same leader. Indeed, if we
remove two different elements then only one of them could be the leader. The leader in the
new sequence occurs more than n/2 − 1 = (n−2)/2
times. Consequently, it is still the leader of the
new sequence of n − 2 elements.
Here is a Python implementation, with O(n) time complexity:
def goldenLeader(A):
n = len(A)
size = 0
for k in xrange(n):
if (size == 0):
size += 1
value = A[k]
else:
if (value != A[k]):
size -= 1
else:
size += 1
candidate = -1
if (size > 0):
candidate = value
leader = -1
count = 0
for k in xrange(n):
if (A[k] == candidate):
count += 1
if (count > n // 2):
leader = candidate
return leader

This is a standard problem in streaming algorithms (where you have a huge (potentially infinite) stream of data) and you have to calculate some statistics from this stream, passing through this stream once.
Clearly you can approach it with hashing or sorting, but with potentially infinite stream you clearly run out of memory. So you have to do something smart here.
The majority element is the element that occurs more than half of the size of the array. This means that the majority element occurs more than all other elements combined or if you count the number of times, majority element appears, and subtract the number of all other elements, you will get a positive number.
So if you count the number of some element, and subtract the number of all other elements and get the number 0 - then your original element can't be a majority element. This if the basis for a correct algorithm:
Have two variables, counter and possible element. Iterate the stream, if the counter is 0 - your overwrite the possible element and initialize the counter, if the number is the same as possible element - increase the counter, otherwise decrease it. Python code:
def majority_element(arr):
counter, possible_element = 0, None
for i in arr:
if counter == 0:
possible_element, counter = i, 1
elif i == possible_element:
counter += 1
else:
counter -= 1
return possible_element
It is clear to see that the algorithm is O(n) with a very small constant before O(n) (like 3). Also it looks like the space complexity is O(1), because we have only three variable initialized. The problem is that one of these variables is a counter which potentially can grow up to n (when the array consists of the same numbers). And to store the number n you need O(log (n)) space. So from theoretical point of view it is O(n) time and O(log(n)) space. From practical, you can fit 2^128 number in a longint and this number of elements in the array is unimaginably huge.
Also note that the algorithm works only if there is a majority element. If such element does not exist it will still return some number, which will surely be wrong. (it is easy to modify the algorithm to tell whether the majority element exists)
History channel: this algorithm was invented somewhere in 1982 by Boyer, Moore and called Boyer–Moore majority vote algorithm.

I have recollections of this algorithm, which might or might not follow the 2K rule. It might need to be rewritten with stacks and the like to avoid breaking the memory limits due to function calls, but this might be unneeded since it only ever has a logarithmic number of such calls. Anyhow, I have vague recollections from college or a recursive solution to this which involved divide and conquer, the secret being that when you divide the groups in half, at least one of the halves still has more than half of its values equal to the max. The basic rule when dividing is that you return two candidate top values, one of which is the top value and one of which is some other value (that may or may not be 2nd place). I forget the algorithm itself.

Proof of correctness for buti-oxa / Jason Hernandez's answer, assuming Jason's answer is the same as buti-oxa's answer and both work the way the algorithm described should work:
We define adjusted suspicion strength as being equal to suspicion strength if top value is selected or -suspicion strength if top value is not selected. Every time you pick the right number, the current adjusted suspicion strength increases by 1. Each time you pick a wrong number, it either drops by 1 or increases by 1, depending on if the wrong number is currently selected. So, the minimum possible ending adjusted suspicion strength is equal to number-of[top values] - number-of[other values]

Related

Sample number with equal probability which is not part of a set

I have a number n and a set of numbers S ∈ [1..n]* with size s (which is substantially smaller than n). I want to sample a number k ∈ [1..n] with equal probability, but the number is not allowed to be in the set S.
I am trying to solve the problem in at worst O(log n + s). I am not sure whether it's possible.
A naive approach is creating an array of numbers from 1 to n excluding all numbers in S and then pick one array element. This will run in O(n) and is not an option.
Another approach may be just generating random numbers ∈[1..n] and rejecting them if they are contained in S. This has no theoretical bound as any number could be sampled multiple times even if it is in the set. But on average this might be a practical solution if s is substantially smaller than n.
Say s is sorted. Generate a random number between 1 and n-s, call it k. We've chosen the k'th element of {1,...,n} - s. Now we need to find it.
Use binary search on s to find the count of the elements of s <= k. This takes O(log |s|). Add this to k. In doing so, we may have passed or arrived at additional elements of s. We can adjust for this by incrementing our answer for each such element that we pass, which we find by checking the next larger element of s from the point we found in our binary search.
E.g., n = 100, s = {1,4,5,22}, and our random number is 3. So our approach should return the third element of [2,3,6,7,...,21,23,24,...,100] which is 6. Binary search finds that 1 element is at most 3, so we increment to 4. Now we compare to the next larger element of s which is 4 so increment to 5. Repeating this finds 5 in so we increment to 6. We check s once more, see that 6 isn't in it, so we stop.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. Binary search finds that 2 elements are at most 4, so we increment to 6. Now we compare to the next larger element of s which is 5 so increment to 7. We check s once more, see that the next number is > 7, so we stop.
If we assume that "s is substantially smaller than n" means |s| <= log(n), then we will increment at most log(n) times, and in any case at most s times.
If s is not sorted then we can do the following. Create an array of bits of size s. Generate k. Parse s and do two things: 1) count the number of elements < k, call this r. At the same time, set the i'th bit to 1 if k+i is in s (0 indexed so if k is in s then the first bit is set).
Now, increment k a number of times equal to r plus the number of set bits is the array with an index <= the number of times incremented.
E.g., n = 100, s = {1,4,5,22}, and our random number is 4. So our approach should return the fourth element of [2,3,6,7,...,21,23,24,...,100] which is 7. We parse s and 1) note that 1 element is below 4 (r=1), and 2) set our array to [1, 1, 0, 0]. We increment once for r=1 and an additional two times for the two set bits, ending up at 7.
This is O(s) time, O(s) space.
This is an O(1) solution with O(s) initial setup that works by mapping each non-allowed number > s to an allowed number <= s.
Let S be the set of non-allowed values, S(i), where i = [1 .. s] and s = |S|.
Here's a two part algorithm. The first part constructs a hash table based only on S in O(s) time, the second part finds the random value k ∈ {1..n}, k ∉ S in O(1) time, assuming we can generate a uniform random number in a contiguous range in constant time. The hash table can be reused for new random values and also for new n (assuming S ⊂ { 1 .. n } still holds of course).
To construct the hash, H. First set j = 1. Then iterate over S(i), the elements of S. They do not need to be sorted. If S(i) > s, add the key-value pair (S(i), j) to the hash table, unless j ∈ S, in which case increment j until it is not. Finally, increment j.
To find a random value k, first generate a uniform random value in the range s + 1 to n, inclusive. If k is a key in H, then k = H(k). I.e., we do at most one hash lookup to insure k is not in S.
Python code to generate the hash:
def substitute(S):
H = dict()
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
H[s] = j
j += 1
return H
For the actual implementation to be O(s), one might need to convert S into something like a frozenset to insure the test for membership is O(1) and also move the len(S) loop invariant out of the loop. Assuming the j in S test and the insertion into the hash (H[s] = j) are constant time, this should have complexity O(s).
The generation of a random value is simply:
def myrand(n, s, H):
k = random.randint(s + 1, n)
return (H[k] if k in H else k)
If one is only interested in a single random value per S, then the algorithm can be optimized to improve the common case, while the worst case remains the same. This still requires S be in a hash table that allows for a constant time "element of" test.
def rand_not_in(n, S):
k = random.randint(len(S) + 1, n);
if k not in S: return k
j = 1
for s in S:
if s > len(S):
while j in S: j += 1
if s == k: return j
j += 1
Optimizations are: Only generate the mapping if the random value is in S. Don't save the mapping to a hash table. Short-circuit the mapping generation when the random value is found.
Actually, the rejection method seems like the practical approach.
Generate a number in 1...n and check whether it is forbidden; regenerate until the generated number is not forbidden.
The probability of a single rejection is p = s/n.
Thus the expected number of random number generations is 1 + p + p^2 + p^3 + ... which is 1/(1-p), which in turn is equal to n/(n-s).
Now, if s is much less than n, or even more up to s = n/2, this expected number is at most 2.
It would take s almost equal to n to make it infeasible in practice.
Multiply the expected time by log s if you use a tree-set to check whether the number is in the set, or by just 1 (expected value again) if it is a hash-set. So the average time is O(1) or O(log s) depending on the set implementation. There is also O(s) memory for storing the set, but unless the set is given in some special way, implicitly and concisely, I don't see how it can be avoided.
(Edit: As per comments, you do this only once for a given set.
If, additionally, we are out of luck, and the set is given as a plain array or list, not some fancier data structure, we get O(s) expected time with this approach, which still fits into the O(log n + s) requirement.)
If attacks against the unbounded algorithm are a concern (and only if they truly are), the method can include a fall-back algorithm for the cases when a certain fixed number of iterations didn't provide the answer.
Similarly to how IntroSort is QuickSort but falls back to HeapSort if the recursion depth gets too high (which is almost certainly a result of an attack resulting in quadratic QuickSort behavior).
Find all numbers that are in a forbidden set and less or equal then n-s. Call it array A.
Find all numbers that are not in a forbidden set and greater then n-s. Call it array B. It may be done in O(s) if set is sorted.
Note that lengths of A and B are equal, and create mapping map[A[i]] = B[i]
Generate number t up to n-s. If there is map[t] return it, otherwise return t
It will work in O(s) insertions to a map + 1 lookup which is either O(s) in average or O(s log s)

Given a permutation's lexicographic number, is it possible to get any item in it in O(1)

I want to know whether the task explained below is even theoretically possible, and if so how I could do it.
You are given a space of N elements (i.e. all numbers between 0 and N-1.) Let's look at the space of all permutations on that space, and call it S. The ith member of S, which can be marked S[i], is the permutation with the lexicographic number i.
For example, if N is 3, then S is this list of permutations:
S[0]: 0, 1, 2
S[1]: 0, 2, 1
S[2]: 1, 0, 2
S[3]: 1, 2, 0
S[4]: 2, 0, 1
S[5]: 2, 1, 0
(Of course, when looking at a big N, this space becomes very large, N! to be exact.)
Now, I already know how to get the permutation by its index number i, and I already know how to do the reverse (get the lexicographic number of a given permutation.) But I want something better.
Some permutations can be huge by themselves. For example, if you're looking at N=10^20. (The size of S would be (10^20)! which I believe is the biggest number I ever mentioned in a Stack Overflow question :)
If you're looking at just a random permutation on that space, it would be so big that you wouldn't be able to store the whole thing on your harddrive, let alone calculate each one of the items by lexicographic number. What I want is to be able to do item access on that permutation, and also get the index of each item. That is, given N and i to specify a permutation, have one function that takes an index number and find the number that resides in that index, and another function that takes a number and finds in which index it resides. I want to do that in O(1), so I don't need to store or iterate over each member in the permutation.
Crazy, you say? Impossible? That may be. But consider this: A block cipher, like AES, is essentially a permutation, and it almost accomplishes the tasks I outlined above. AES has a block size of 16 bytes, meaning that N is 256^16 which is around 10^38. (The size of S, not that it matters, is a staggering (256^16)!, or around 10^85070591730234615865843651857942052838, which beats my recent record for "biggest number mentioned on Stack Overflow" :)
Each AES encryption key specifies a single permutation on N=256^16. That permutation couldn't be stored whole on your computer, because it has more members than there are atoms in the solar system. But, it allows you item access. By encrypting data using AES, you're looking at the data block by block, and for each block (member of range(N)) you output the encrypted block, which the member of range(N) that is in the index number of the original block in the permutation. And when you're decrypting, you're doing the reverse (Finding the index number of a block.) I believe this is done in O(1), I'm not sure but in any case it's very fast.
The problem with using AES or any other block cipher is that it limits you to very specific N, and it probably only captures a tiny fraction of the possible permutations, while I want to be able to use any N I like, and do item access on any permutation S[i] that I like.
Is it possible to get O(1) item access on a permutation, given size N and permutation number i? If so, how?
(If I'm lucky enough to get code answers here, I'd appreciate if they'll be in Python.)
UPDATE:
Some people pointed out the sad fact that the permutation number itself would be so huge, that just reading the number would make the task non-feasible. Then, I'd like to revise my question: Given access to the factoradic representation of a permutation's lexicographic number, is it possible to get any item in the permutation in O(as small as possible)?
The secret to doing this is to "count in base factorial".
In the same way that 134 = 1*10^2+3*10 + 4, 134 = 5! + 2 * 3! + 2! => 10210 in factorial notation (include 1!, exclude 0!). If you want to represent N!, you will then need N^2 base ten digits. (For each factorial digit N, the maximum number it can hold is N). Up to a bit of confusion about what you call 0, this factorial representation is exactly the lexicographic number of a permutation.
You can use this insight to solve Euler Problem 24 by hand. So I will do that here, and you will see how to solve your problem. We want the millionth permutation of 0-9. In factorial representation we take 1000000 => 26625122. Now to convert that to the permutation, I take my digits 0,1,2,3,4,5,6,7,8,9, and The first number is 2, which is the third (it could be 0), so I select 2 as the first digit, then I have a new list 0,1,3,4,5,6,7,8,9 and I take the seventh number which is 8 etc, and I get 2783915604.
However, this assumes that you start your lexicographic ordering at 0, if you actually start it at one, you have to subtract 1 from it, which gives 2783915460. Which is indeed the millionth permutation of the numbers 0-9.
You can obviously reverse this procedure, and hence convert backwards and forwards easily between the lexiographic number and the permutation that it represents.
I am not entirely clear what it is that you want to do here, but understanding the above procedure should help. For example, its clear that the lexiographic number represents an ordering which could be used as the key in a hashtable. And you can order numbers by comparing digits left to right so once you have inserted a number you never have to work outs it factorial.
Your question is a bit moot, because your input size for an arbitrary permutation index has size log(N!) (assuming you want to represent all possible permutations) which is Theta(N log N), so if N is really large then just reading the input of the permutation index would take too long, certainly much longer than O(1). It may be possible to store the permutation index in such a way that if you already had it stored, then you could access elements in O(1) time. But probably any such method would be equivalent to just storing the permutation in contiguous memory (which also has Theta(N log N) size), and if you store the permutation directly in memory then the question becomes trivial assuming you can do O(1) memory access. (However you still need to account for the size of the bit encoding of the element, which is O(log N)).
In the spirit of your encryption analogy, perhaps you should specify a small SUBSET of permutations according to some property, and ask if O(1) or O(log N) element access is possible for that small subset.
Edit:
I misunderstood the question, but it was not in waste. My algorithms let me understand: the factoradic representation of a permutation's lexicographic number is almost the same as the permutation itself. In fact the first digit of the factoradic representation is the same as the first element of the corresponding permutation (assuming your space consists of numbers from 0 to N-1). Knowing this there is not really a point in storing the index rather than the permutation itself . To see how to convert the lexicographic number into a permutation, read below.
See also this wikipedia link about Lehmer code.
Original post:
In the S space there are N elements that can fill the first slot, meaning that there are (N-1)! elements that start with 0. So i/(N-1)! is the first element (lets call it 'a'). The subset of S that starts with 0 consists of (N-1)! elements. These are the possible permutations of the set N{a}. Now you can get the second element: its the i(%((N-1)!)/(N-2)!). Repeat the process and you got the permutation.
Reverse is just as simple. Start with i=0. Get the 2nd last element of the permutation. Make a set of the last two elements, and find the element's position in it (its either the 0th element or the 1st), lets call this position j. Then i+=j*2!. Repeat the process (you can start with the last element too, but it will always be the 0th element of the possibilities).
Java-ish pesudo code:
find_by_index(List N, int i){
String str = "";
for(int l = N.length-1; i >= 0; i--){
int pos = i/fact(l);
str += N.get(pos);
N.remove(pos);
i %= fact(l);
}
return str;
}
find_index(String str){
OrderedList N;
int i = 0;
for(int l = str.length-1; l >= 0; l--){
String item = str.charAt(l);
int pos = N.add(item);
i += pos*fact(str.length-l)
}
return i;
}
find_by_index should run in O(n) assuming that N is pre ordered, while find_index is O(n*log(n)) (where n is the size of the N space)
After some research in Wikipedia, I desgined this algorithm:
def getPick(fact_num_list):
"""fact_num_list should be a list with the factorial number representation,
getPick will return a tuple"""
result = [] #Desired pick
#This will hold all the numbers pickable; not actually a set, but a list
#instead
inputset = range(len(fact_num_list))
for fnl in fact_num_list:
result.append(inputset[fnl])
del inputset[fnl] #Make sure we can't pick the number again
return tuple(result)
Obviously, this won't reach O(1) due the factor we need to "pick" every number. Due we do a for loop and thus, assuming all operations are O(1), getPick will run in O(n).
If we need to convert from base 10 to factorial base, this is an aux function:
import math
def base10_baseFactorial(number):
"""Converts a base10 number into a factorial base number. Output is a list
for better handle of units over 36! (after using all 0-9 and A-Z)"""
loop = 1
#Make sure n! <= number
while math.factorial(loop) <= number:
loop += 1
result = []
if not math.factorial(loop) == number:
loop -= 1 #Prevent dividing over a smaller number than denominator
while loop > 0:
denominator = math.factorial(loop)
number, rem = divmod(number, denominator)
result.append(rem)
loop -= 1
result.append(0) #Don't forget to divide to 0! as well!
return result
Again, this will run in O(n) due to the whiles.
Summing all, the best time we can find is O(n).
PS: I'm not a native English speaker, so spelling and phrasing errors may appear. Apologies in advance, and let me know if you can't get around something.
All correct algorithms for accessing the kth item of a permutation stored in factoradic form must read the first k digits. This is because, regardless of the values of the other digits among the first k, it makes a difference whether an unread digit is a 0 or takes on its maximum value. That this is the case can be seen by tracing the canonical correct decoding program in two parallel executions.
For example, if we want to decode the third digit of the permutation 1?0, then for 100, that digit is 0, and for 110, that digit is 2.

Write a program to find 100 largest numbers out of an array of 1 billion numbers

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."
I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.
Arrays.sort(array);
The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?
You can keep a priority queue of the 100 biggest numbers, iterate through the 1 billion numbers. Whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.
A priority queue implemented with a heap has insert + delete complexity of O(log K). (Where K = 100, the number of elements to find. N = 1 billion, the number of total elements in the array).
In the worst case you get billion*log2(100) which is better than billion*log2(billion) for an O(N log N) comparison-based sort1.
In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K) rather than O(N log N), this can be very significant when K is very small comparing to N.
The expected time of this priority queue algorithm is pretty interesting, since in each iteration an insertion may or may not occur.
The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability.
For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.
Thus, the expected number of insertions is:
And the expected running time can be expressed as:
(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)
Note that when N is very large comparing to K, this expression is a lot closer to n rather than N log K. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.
But we don't know that the array values are uniformly distributed. They might trend towards increasing, in which case most or all numbers will be be new candidates for the set of 100 largest numbers seen. The worst case for this algorithm is O(N log K).
Or if they trend towards decreasing, most of the largest 100 numbers will be very early, and our best-case run time is essentially O(N + K log K), which is just O(N) for K much smaller than N.
Footnote 1: O(N) integer sorting / histogramming
Counting Sort or Radix Sort are both O(N), but often have larger constant factors that make them worse than comparison sorts in practice. In some special cases they're actually quite fast, primarily for narrow integer types.
For example, Counting Sort does well if the numbers are small. 16-bit numbers would only need an array of 2^16 counters. And instead of actually expanding back into a sorted array, you could just scan the histogram you build as part of Counting Sort.
After histogramming an array, you can quickly answer queries for any order statistic, e.g. the 99 largest numbers, the 200 to 100th largest numbers.) 32-bit numbers would scatter the counts over a much larger array or hash table of counters, potentially needing 16 GiB of memory (4 bytes for each of 2^32 counters). And on real CPUs, probably getting lots of TLB and cache misses, unlike an array of 2^16 elements where L2 cache would typically hit.
Similarly, Radix Sort could look at only the top buckets after a first pass. But the constant factors may still be larger than log K, depending on K.
Note that the size of each counter is large enough to not overflow even if all N integers are duplicates. 1 billion is somewhat below 2^30, so a 30-bit unsigned counter would be sufficient. And a 32-bit signed or unsigned integer is just fine.
If you had many more, you might need 64-bit counters, taking twice the memory footprint to initialize to zero and to randomly access. Or a sentinel value for the few counters that overflow a 16 or 32-bit integer, to indicate that the rest of the count is elsewhere (in a small dictionary such as a hash table mapping to 64-bit counters).
If this is asked in an interview, the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.
The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.
Anyway, making the question more specific and clearer is good for you in an interview.
You can iterate over the numbers which takes O(n)
Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.
The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.
I realized that this is tagged with 'algorithm', but will toss out some other options, since it probably should also be tagged 'interview'.
What is the source of the 1 billion numbers? If it is a database then 'select value from table order by value desc limit 100' would do the job quite nicely - there might be dialect differences.
Is this a one-off, or something that will be repeated? If repeated, how frequently? If it is a one-off and the data are in a file, then 'cat srcfile | sort (options as needed) | head -100' will have you quickly doing productive work that you are getting paid to do while the computer handles this trivial chore.
If it is repeated, you would advise picking any decent approach to get the initial answer and store / cache the results so that you could continuously be able to report the top 100.
Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.
Better luck on the next interview.
My immediate reaction for this would be to use a heap, but there is way to use QuickSelect without keeping all of the input values on hand at any one time.
Create an array of size 200 and fill it up with the first 200 input values. Run QuickSelect and discard the low 100, leaving you with 100 free places. Read in the next 100 input values and run QuickSelect again. Continue until you have run though the entire input in batches of 100.
At the end you have the top 100 values. For N values you have run QuickSelect roughly N/100 times. Each Quickselect cost about 200 times some constant, so the total cost is 2N times some constant. This looks linear in the size of the input to me, regardless of the parameter size that I am hardwiring to be 100 in this explanation.
You can use Quick select algorithm to find the number at the(by order) index [billion-101]
and then iterate over the numbers and to find the numbers that biger from that number.
array={...the billion numbers...}
result[100];
pivot=QuickSelect(array,billion-101);//O(N)
for(i=0;i<billion;i++)//O(N)
if(array[i]>=pivot)
result.add(array[i]);
This algorithm Time is: 2 X O(N) = O(N) (Average case performance)
The second option like Thomas Jungblut suggest is:
Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).
This algorithm Time is:O(N) + 100 X O(Log(N)) = O(N)
Although the other quickselect solution has been downvoted, the fact remains that quickselect will find the solution faster than using a queue of size 100. Quickselect has an expected running time of 2n + o(n), in terms of comparisons. A very simply implementation would be
array = input array of length n
r = Quickselect(array,n-100)
result = array of length 100
for(i = 1 to n)
if(array[i]>r)
add array[i] to result
This will take 3n + o(n) comparisons on average. Moreover, it can be made more efficient using the fact that quickselect will leave the largest 100 items in the array in the 100 right-most locations. So in fact, the running time can be improved to 2n+o(n).
There is the issue that this is expected running time, and not worst case, but by using a decent pivot selection strategy (e.g. pick 21 elements at random, and choose the median of those 21 as pivot), then the number of comparisons can be guaranteed with high probability to be at most (2+c)n for an arbitrarily small constant c.
In fact, by using an optimized sampling strategy (e.g. sample sqrt(n) elements at random, and choose the 99th percentile), the running time can be gotten down to (1+c)n + o(n) for arbitrarily small c (assuming that K, the number of elements to be selected is o(n)).
On the other hand, using a queue of size 100 will require O(log(100)n) comparisons, and log base 2 of 100 is approximately equal to 6.6.
If we think of this problem in the more abstract sense of choosing the largest K elements from an array of size N, where K=o(N) but both K and N go to infinity, then the running time of the quickselect version will be O(N) and the queue version will be O(N log K), so in this sense quickselect is also asymptotically superior.
In comments, it was mentioned that the queue solution will run in expected time N + K log N on a random input. Of course, the random input assumption is never valid unless the question states it explicitly. The queue solution could be made to traverse the array in a random order, but this will incur the additional cost of N calls to a random number generator as well as either permuting the entire input array or else allocating a new array of length N containing the random indices.
If the problem doesn't allow you to move around the elements in the original array, and the cost of allocating memory is high so duplicating the array is not an option, that is a different matter. But strictly in terms of running time, this is the best solution.
take the first 100 numbers of the billion and sort them. now just iterate through the billion, if the source number is higher than the smallest of 100, insert in sort order. What you end up with is something much closer to O(n) over the size of the set.
Two options:
(1) Heap (priorityQueue)
Maintain a min-heap with size of 100. Traverse the array. Once the element is smaller than first element in heap, replace it.
InSERT ELEMENT INTO HEAP: O(log100)
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)
(2) Map-reduce model.
This is very similar to word count example in hadoop.
Map job: count every element's frequency or times appeared.
Reduce: Get top K element.
Usually, I would give the recruiter two answers. Give them whatever they like. Of course, map reduce coding would be labor-some because you have to know every exact parameters. No harm to practice it.
Good Luck.
An very easy solution would be to iterate through the array 100 times. Which is O(n).
Each time you pull out the largest number (and change its value to the minimum value, so that you don't see it in the next iteration, or keep track of indexes of previous answers (by keeping track of indexes the original array can have multiple of the same number)). After 100 iterations, you have the 100 largest numbers.
The simple solution would be using a priority queue, adding the first 100 numbers to the queue and keeping track of the smallest number in the queue, then iterating through the other billion numbers, and each time we find one that is larger than the largest number in the priority queue, we remove the smallest number, add the new number, and again keep track of the smallest number in the queue.
If the numbers were in random order, this would work beautiful because as we iterate through a billion random numbers, it would be very rare that the next number is among the 100 largest so far. But the numbers might not be random. If the array was already sorted in ascending order then we would always insert an element to the priority queue.
So we pick say 100,000 random numbers from the array first. To avoid random access which might be slow, we add say 400 random groups of 250 consecutive numbers. With that random selection, we can be quite sure that very few of the remaining numbers are in the top hundred, so the execution time will be very close to that of a simple loop comparing a billion numbers to some maximum value.
This question would be answered with N log(100) complexity (instead of N log N) with just one line of C++ code.
std::vector<int> myvector = ...; // Define your 1 billion numbers.
// Assumed integer just for concreteness
std::partial_sort (myvector.begin(), myvector.begin()+100, myvector.end());
The final answer would be a vector where the first 100 elements are guaranteed to be the 100 biggest numbers of you array while the remaining elements are unordered
C++ STL (standard library) is quite handy for this kind of problems.
Note: I am not saying that this is the optimal solution, but it would have saved your interview.
I see a lot of O(N) discussions, so I propose something different just for the thought exercise.
Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.
However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.
There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this - it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.
Inspired by #ron teller's answer, here is a barebones C program to do what you want.
#include <stdlib.h>
#include <stdio.h>
#define TOTAL_NUMBERS 1000000000
#define N_TOP_NUMBERS 100
int
compare_function(const void *first, const void *second)
{
int a = *((int *) first);
int b = *((int *) second);
if (a > b){
return 1;
}
if (a < b){
return -1;
}
return 0;
}
int
main(int argc, char ** argv)
{
if(argc != 2){
printf("please supply a path to a binary file containing 1000000000"
"integers of this machine's wordlength and endianness\n");
exit(1);
}
FILE * f = fopen(argv[1], "r");
if(!f){
exit(1);
}
int top100[N_TOP_NUMBERS] = {0};
int sorts = 0;
for (int i = 0; i < TOTAL_NUMBERS; i++){
int number;
int ok;
ok = fread(&number, sizeof(int), 1, f);
if(!ok){
printf("not enough numbers!\n");
break;
}
if(number > top100[0]){
sorts++;
top100[0] = number;
qsort(top100, N_TOP_NUMBERS, sizeof(int), compare_function);
}
}
printf("%d sorts made\n"
"the top 100 integers in %s are:\n",
sorts, argv[1] );
for (int i = 0; i < N_TOP_NUMBERS; i++){
printf("%d\n", top100[i]);
}
fclose(f);
exit(0);
}
On my machine (core i3 with a fast SSD) it takes 25 seconds, and 1724 sorts.
I generated a binary file with dd if=/dev/urandom/ count=1000000000 bs=1 for this run.
Obviously, there are performance issues with reading only 4 bytes at a time - from disk, but this is for example's sake. On the plus side, very little memory is needed.
You can do it in O(n) time. Just iterate through the list and keep track of the 100 biggest numbers you've seen at any given point and the minimum value in that group. When you find a new number bigger the smallest of your ten, then replace it and update your new min value of the 100 (may take a constant time of 100 to determine this each time you do it, but this does not affect the overall analysis).
The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.
I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array.
So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison.
But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?
At the theoretical level: How many comparisons are needed for inserting in a heap. I know it is O(log(n)) but how large is the constant factor? I
At the machine level: What is the impact of caching and branch prediction on the execution time of a heap insert and a linear search in an array.
At the implementation level: What additional costs are hidden in a heap data structure supplied by a library or a compiler?
I think these are some of the questions that have to be answered before one can try to estimate the real difference between the performance of a 100 element heap or a 100 element array. So it would make sense to make an experiment and measure the real performance.
Although in this question we should search for top 100 numbers, I will
generalize things and write x. Still, I will treat x as constant value.
Algorithm Biggest x elements from n:
I will call return value LIST. It is a set of x elements (in my opinion that should be linked list)
First x elements are taken from pool "as they come" and sorted in LIST (this is done in constant time since x is treated as constant - O( x log(x) ) time)
For every element that comes next we check if it is bigger than smallest element in LIST and if is we pop out the smallest and insert current element to LIST. Since that is ordered list every element should find its place in logarithmic time (binary search) and since it is ordered list insertion is not a problem. Every step is also done in constant time ( O(log(x) ) time ).
So, what is the worst case scenario?
x log(x) + (n-x)(log(x)+1) = nlog(x) + n - x
So that is O(n) time for worst case. The +1 is the checking if number is greater than smallest one in LIST. Expected time for average case will depend on mathematical distribution of those n elements.
Possible improvements
This algorithm can be slightly improved for worst case scenario but IMHO (I can not prove this claim) that will degrade average behavior. Asymptotic behavior will be the same.
Improvement in this algorithm will be that we will not check if element is greater than smallest. For each element we will try to insert it and if it is smaller than smallest we will disregard it. Although that sounds preposterous if we regard only the worst case scenario we will have
x log(x) + (n-x)log(x) = nlog(x)
operations.
For this use case I don't see any further improvements. Yet you must ask yourself - what if I have to do this more than log(n) times and for different x-es? Obviously we would sort that array in O(n log(n)) and take our x element whenever we need them.
Finding the top 100 out of a billion numbers is best done using min-heap of 100 elements.
First prime the min-heap with the first 100 numbers encountered. min-heap will store the smallest of the first 100 numbers at the root (top).
Now as you go along the rest of the numbers only compare them with the root (smallest of the 100).
If the new number encountered is larger than root of min-heap replace the root with that number otherwise ignore it.
As part of the insertion of the new number in min-heap the smallest number in the heap will come to the top (root).
Once we have gone through all the numbers we will have the largest 100 numbers in the min-heap.
I have written up a simple solution in Python in case anyone is interested. It uses the bisect module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.
import bisect
def kLargest(A, k):
'''returns list of k largest integers in A'''
ret = []
for i, a in enumerate(A):
# For first k elements, simply construct sorted temp list
# It is treated similarly to a priority queue
if i < k:
bisect.insort(ret, a) # properly inserts a into sorted list ret
# Iterate over rest of array
# Replace and update return array when more optimal element is found
else:
if a > ret[0]:
del ret[0] # pop min element off queue
bisect.insort(ret, a) # properly inserts a into sorted list ret
return ret
Usage with 100,000,000 elements and worst-case input which is a sorted list:
>>> from so import kLargest
>>> kLargest(range(100000000), 100)
[99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907,
99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915,
99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923,
99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931,
99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939,
99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947,
99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955,
99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963,
99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971,
99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979,
99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987,
99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995,
99999996, 99999997, 99999998, 99999999]
It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).
Time ~ O(100 * N)
Space ~ O(100 + N)
Create an empty list of 100 empty slot
For every number in input-list:
If the number is smaller than the first one, skip
Otherwise replace it with this number
Then, push the number through adjacent swap; until it's smaller than the next one
Return the list
Note: if the log(input-list.size) + c < 100, then the optimal way is to sort the input-list, then split first 100 items.
Another O(n) algorithm -
The algorithm finds the largest 100 by elimination
consider all the million numbers in their binary representation. Start from the most significant bit. Finding if the MSB is 1 can be a done by a boolean operation multiplication with an appropriate number. If there are more than 100 1's in these million eliminate the other numbers with zeros. Now of the remaining numbers proceed with the next most significant bit. keep a count of the number of remaining numbers after elimination and proceed as long as this number is greater than 100.
The major boolean operation can be an parallely done on GPUs
I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.
I know this might get buried, but here is my idea for a variation on a radix MSD.
pseudo-code:
//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];
for number in billion
putInTop100Array(number)
function putInTop100Array(number){
//basically if we got past all the digits successfully
if(number == null)
return true;
msdIdx = getMsdIdx(number);
msd = getMsd(number);
//check if the idx above where we are is already full
if(mynums[msdIdx][msd+1] > 99) {
return false;
} else if(putInTop100Array(removeMSD(number)){
mynums[msdIdx][msd]++;
//we've found 100 digits here, no need to keep looking below where we are
if(mynums[msdIdx][msd] > 99){
for(int i = 0; i < mds; i++){
//making it 101 just so we can tell the difference
//between numbers where we actually found 101, and
//where we just set it
mynums[msdIdx][i] = 101;
}
}
return true;
}
return false;
}
The function getMsdIdx(int num) would return the index of the most significant digit (non-zero). The function getMsd(int num) would return the most significant digit. The funciton removeMSD(int num) would remove the most significant digit from a number and return the number (or return null if there was nothing left after removing the most significant digit).
Once this is done, all that is left is traversing mynums to grab the top 100 digits. This would be something like:
int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
int timesAdded = 0;
for(int j = 16; j >=0 && timesAdded < 100; j--){
for(int k = mynums[i][j]; k > 0; k--){
nums[idx] += j;
timesAdded++;
idx++;
}
}
}
I should note that although the above looks like it has high time complexity, it will really only be around O(7*100).
A quick explanation of what this is trying to do:
Essentially this system is trying to use every digit in a 2d-array based upon the index of the digit in the number, and the digit's value. It uses these as indexes to keep track of how many numbers of that value have been inserted in the array. When 100 has been reached, it closes off all "lower branches".
The time of this algorithm is something like O(billion*log(16)*7)+O(100). I could be wrong about that. Also it is very likely this needs debugging as it is kinda complex and I just wrote it off the top of my head.
EDIT: Downvotes without explanation are not helpful. If you think this answer is incorrect, please leave a comment why. Pretty sure that StackOverflow even tells you to do so when you downvote.
Managing a separate list is extra work and you have to move things around the whole list every time you find another replacement. Just qsort it and take the top 100.
Use nth-element to get the 100'th element O(n)
Iterate the second time but only once and output every element that is greater than this specific element.
Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.
It's a question from Google or some else industry giants.Maybe the following code is the right answer expected by your interviewer.
The time cost and space cost depend on the maximum number in the input array.For 32-Bit int array input, The maximum space cost is 4 * 125M Bytes, Time cost is 5 * Billion.
public class TopNumber {
public static void main(String[] args) {
final int input[] = {2389,8922,3382,6982,5231,8934
,4322,7922,6892,5224,4829,3829
,6892,6872,4682,6723,8923,3492};
//One int(4 bytes) hold 32 = 2^5 value,
//About 4 * 125M Bytes
//int sort[] = new int[1 << (32 - 5)];
//Allocate small array for local test
int sort[] = new int[1000];
//Set all bit to 0
for(int index = 0; index < sort.length; index++){
sort[index] = 0;
}
for(int number : input){
sort[number >>> 5] |= (1 << (number % 32));
}
int topNum = 0;
outer:
for(int index = sort.length - 1; index >= 0; index--){
if(0 != sort[index]){
for(int bit = 31; bit >= 0; bit--){
if(0 != (sort[index] & (1 << bit))){
System.out.println((index << 5) + bit);
topNum++;
if(topNum >= 3){
break outer;
}
}
}
}
}
}
}
i did my own code,not sure if its what the "interviewer" it's looking
private static final int MAX=100;
PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);
queue.add(array[0]);
for (int i=1;i<array.length;i++)
{
if(queue.peek()<array[i])
{
if(queue.size() >=MAX)
{
queue.poll();
}
queue.add(array[i]);
}
}
Possible improvements.
If the file contains 1 billions number, reading it could be really long...
To improve this working you can :
Split the file into n parts, Create n threads, make n threads look each for the 100 biggest numbers in their part of the file (using the priority queue), and finally get the 100 biggest numbers of all threads output.
Use a cluster to do a such task, with a solution like hadoop. Here you can split the file even more and have the output quicker for a 1 billion (or a 10^12) numbers file.
First take 1000 elements and add them in a max heap. Now take out the first max 100 elements and store it somewhere. Now pick next 900 elements from the file and add them in the heap along with the last 100 highest element.
Keep repeating this process of picking up 100 elements from the heap and adding 900 elements from the file.
The final pick of 100 elements will give us the maximum 100 elements from a billion of numbers.
THe complexity is O(N)
First create an array of 100 ints initialiaze the first element of this array as the first element of the N values,
keep track of the index of the current element with a another variable, call it CurrentBig
Iterate though the N values
if N[i] > M[CurrentBig] {
M[CurrentBig]=N[i]; ( overwrite the current value with the newly found larger number)
CurrentBig++; ( go to the next position in the M array)
CurrentBig %= 100; ( modulo arithmetic saves you from using lists/hashes etc.)
M[CurrentBig]=N[i]; ( pick up the current value again to use it for the next Iteration of the N array)
}
when done , print the M array from CurrentBig 100 times modulo 100 :-)
For the student: make sure that the last line of the code does not trump valid data right before the code exits

Given a sorted array, find the maximum subarray of repeated values

Yet another interview question asked me to find the maximum possible subarray of repeated values given a sorted array in shortest computational time possible.
Let input array be A[1 ... n]
Find an array B of consecutive integers in A such that:
for x in range(len(B)-1):
B[x] == B[x+1]
I believe that the best algorithm is dividing the array in half and going from the middle outwards and comparing from the middle the integers with one another and finding the longest strain of the same integers from the middle. Then I would call the method recursively by dividing the array in half and calling the method on the two halves.
My interviewer said my algorithm is good but my analysis that the algorithm is O(logn) is incorrect but never got around to telling me what the correct answer is. My first question is what is the Big-O analysis of this algorithm? (Show as much work as possible please! Big-O is not my forte.) And my second question is purely for my curiosity whether there is an even more time efficient algorithm?
The best you can do for this problem is an O(n) solution, so your algorithm cannot possibly be both correct and O(lg n).
Consider for example, the case where the array contains no repeated elements. To determine this, one needs to examine every element, and examining every element is O(n).
This is a simple algorithm that will find the longest subsequence of a repeated element:
start = end = 0
maxLength = 0
i = 0
while i + maxLength < a.length:
if a[i] == a[i + maxLength]:
while i + maxLength < a.length and a[i] == a[i + maxLength]:
maxLength += 1
start = i
end = i + maxLength
i += maxLength
return a[start:end]
If you have reason to believe the subsequence will be long, you can set the initial value of maxLength to some heuristically selected value to speed things along, and then only look for shorter sequences if you don't find one (i.e. you end up with end == 0 after the first pass.)
I think we all agree that in the worst case scenario, where all of A is unique or where all of A is the same, you have to examine every element in the array to either determine there are no duplicates or determine all the array contains one number. Like the other posters have said, that's going to be O(N). I'm not sure divide & conquer helps you much with algorithmic complexity on this one, though you may be able to simplify the code a bit by using recursion. Divide & conquer really helps cut down on Big O when you can throw away large portions of the input (e.g. Binary Search), but in the case where you potentially have to examine all the input, it's not going to be much different.
I'm assuming the result here is you're just returning the size of the largest B you've found, though you could easily modify this to return B instead.
So on the algorithm front, given that A is sorted, I'm not sure there's going to be any answer faster/simpler answer than just walking through the array in order. It seems like the simplest answer is to have 2 pointers, one starting at index 0 and one starting at index 1. Compare them and then increment them both; each time they're the same you tick a counter upward to give you the current size of B and when they differ you reset that counter to zero. You also keep around a variable for the max size of a B you've found so far and update it every time you find a bigger B.
In this algorithm, n elements are visited with a constant number of calculations per each visited element, so the running time is O(n).
Given sorted array A[1..n]:
max_start = max_end = 1
max_length = 1
start = end = 1
while start < n
while A[start] == A[end] && end < n
end++
if end - start > max_length
max_start = start
max_end = end - 1
max_length = end - start
start = end
Assuming that the longest consecutive integers is only of length 1, you'll be scanning through the entire array A of n items. Thus, the complexity is not in terms of n, but in terms of len(B).
Not sure if the complexity is O(n/len(B)).
Checking the 2 edge case
- When n == len(B), you get instant result (only checking A[0] and A[n-1]
- When n == 1, you get O(n), checking all elements
- When normal case, I'm too lazy to write the algo to analyze...
Edit
Given that len(B) is not known in advance, we must take the worst case, i.e. O(n)

Generating permutations with sub-linear memory

I am wondering if there is a sufficiently simple algorithm for generating permutations of N elements, say 1..N, which uses less than O(N) memory. It does not have to be compute n-th permutation, but it must be able to compute all permutations.
Of course, this algorithm should be a generator of some kind, or use some internal data structure which uses less than O(N) memory, since return the result as a vector of size N already violates the restriction on sub-linear memory.
Let's assume that the random permutation is being generated one entry at a time. The state of the generator must encode the set of elements remaining (run it to completion) and so, since no possibility can be excluded, the generator state is at least n bits.
Maybe you can, with factoradic numbers. You can extract the resulting permutation from it step by step, so you never have to have the entire result in memory.
But the reason I started with maybe, is that I'm not sure what the growing behaviour of the size of the factoradic number itself is. If it fits in an 32bit integer or something like that, N would be limited to a constant so O(N) would equal O(1), so we have to use an array for it, but I'm unsure how big it will be in terms of N.
I think the answer has to be "no".
Consider the generator for N-element permutations as a state machine: it must contain at a least as many states as there are permutations, else it will start repeating before it finishes generating all of them.
There are N! such permutations, which will require at least ceil(log2(N!)) bits to represent. Stirling's approximation tells us log2(N!) is O(N log N), so we will be unable to create such a generator with sub-linear memory.
The C++ algorithm next_permutation performs an in-place rearrangement of a sequence into its next permutation, or returns false when no further permutations exist. The algorithm is as follows:
template <class BidirectionalIterator>
bool next_permutation(BidirectionalIterator first, BidirectionalIterator last) {
if (first == last) return false; // False for empty ranges.
BidirectionalIterator i = first;
++i;
if (i == last) return false; // False for single-element ranges.
i = last;
--i;
for(;;) {
BidirectionalIterator ii = i--;
// Find an element *n < *(n + 1).
if (*i <*ii) {
BidirectionalIterator j = last;
// Find the last *m >= *n.
while (!(*i < *--j)) {}
// Swap *n and *m, and reverse from m to the end.
iter_swap(i, j);
reverse(ii, last);
return true;
}
// No n was found.
if (i == first) {
// Reverse the sequence to its original order.
reverse(first, last);
return false;
}
}
}
This uses constant space (the iterators) for each permutation generated. Do you consider that linear?
I think that to even store your result (which will be an ordered list of N items) will be O(N) in memory, no?
Anyhow, to answer your later question about picking a permutation at random, here's a technique that will be better than just producing all N! possibilities in a list, say, and then picking an index randomly. If we can just pick the index randomly and generate the associated permutation from it, we're much better off.
We can imagine the dictionary order on your words/permutations, and associate a unique number to these based on the word's/permutation's order of appearance in the dictionary. E.g., words on three characters would be
perm. index
012 <----> 0
021 <----> 1
102 <----> 2
120 <----> 3
201 <----> 4
210 <----> 5
You'll see later why it was easiest to use the numbers we did, but others could be accommodated with a bit more work.
To choose one at random, you could pick its associated index randomly from the range 0 ... N!-1 with a uniform probability (the simplest implementation of this is clearly out of the question for even moderately large N, I know, but I think there are decent workarounds) and then determine its associated permutation. Notice that the list begins with all the permutations of the last N-1 elements, keeping the first digit fixed equal to 0. After those possibilities are exhausted, we generate all those that start with 1. After these next (N-1)! permutations are exhausted, we generate those that start with a 2. Etc. Thus we can determine the leading digit is Floor[R / (N-1)!], where R was the index in the sense shown above. See now why we zero indexed, too?
To generate the remaining N-1 digits in the permutation, let's say that we determined Floor[R/(N-1)!] = a0. Start with the list {0, ..., N-1} - {a0} (set subtraction). We want the Qth permutation of this list, for Q = R mod (N-1)!. Except for accounting for the fact that there's a missing digit, this is just the same as the problem we've just solved. Recurse.

Resources