Population segmentation algorithm - algorithm

I have a population of 50 ordered integers (1,2,3,..,50) and I look for a generic way to slice it "n" ways ("n" is the number of cutoff points ranging from 1 to 25) that maintains the order of the elements.
For example, for n=1 (one cutoff point) there are 49 possible grouping alternatives ([1,2-49], [1-2,3-50], [1-3,4-50],...). For n=2 (two cutoff points), the grouping alternatives are like: [1,2,3-50], [1,2-3,4-50],...
Could you recommend any general-purpose algorithm to complete this task in an efficient way?
Thanks,
Chris
Thanks everyone for your feedback. I reviewed all your comments and I am working on a generic solution that will return all combinations (e.g., [1,2,3-50], [1,2-3,4-50],...) for all numbers of cutoff points.
Thanks again,
Chris

Let sequence length be N, and number of slices n.
That problem becomes easier when you notice that, choosing a slicing to n slices is equivalent to choosing n - 1 from N - 1 possible split points (a split point is between every two numbers in the sequence). Hence there is (N - 1 choose n - 1) such slicings.
To generate all slicings (to n slices), you have to generate all n - 1 element subsets of numbers from 1 to N - 1.
The exact algorithm for this problem is placed here: How to iteratively generate k elements subsets from a set of size n in java?

Do you need the cutoffs, or are you just counting them. If you're just going to count them, then it's simple:
1 cutoff = (n-1) options
2 cutoffs = (n-1)*(n-2)/2 options
3 cutoffs = (n-1)(n-2)(n-3)/4 options
you can see the patterns here
If you actually need the cutoffs, then you have to actually do the loops, but since n is so small, Emilio is right, just brute force it.
1 cutoff
for(i=1,i<n;++i)
cout << i;
2 cutoffs
for(i=1;<i<n;++i)
for(j=i+1,j<n;++j)
cout << i << " " << j;
3 cutoffs
for(i=1;<i<n;++i)
for(j=i+1,j<n;++j)
for(k=j+1,k<n;++k)
cout << i << " " << j << " " << k;
again, you can see the pattern

So you want to select 25 split point from 49 choices in all possible ways. There are a lot of well known algorithms to do that.
I want to draw your attention to another side of this problem. There are 49!/(25!*(49-25)!) = 63 205 303 218 876 >= 2^45 ~= 10^13 different combinations. So if you want to store it, the required amount of memory is 32TB * sizeof(Combination). I guess that it will pass 1 PB mark.
Now lets assume that you want to process generated data on the fly. Lets make rather optimistic assumption that you can process 1 million combinations per second (here i assume that there is no parallelization). So this task will take 10^7 seconds = 2777 hours = 115 days.
This problem is more complicated than it seems at first glance. If you want to solve if at home in reasonable time, my suggestion is to change the strategy or wait for the advance of quantum computers.

This will generate an array of all the ranges, but I warn you, it'll take tons of memory, due to the large numbers of results (50 elements with 3 splits is 49*48*47=110544) I haven't even tried to compile it, so there's probably errors, but this is the general algorithm I'd use.
typedef std::vector<int>::iterator iterator_t;
typedef std::pair<iterator_t, iterator_t> range_t;
typedef std::vector<range_t> answer_t;
answer_t F(std::vector<int> integers, int slices) {
answer_t prev; //things to slice more
answer_t results; //thin
//initialize results for 0 slices
results.push_back(answer(range(integers.begin(), integers.end()), 1));
//while there's still more slicing to do
while(slices--) {
//move "results" to the "things to slice" pile
prev.clear();
prev.swap(results);
//for each thing to slice
for(int group=0; group<prev.size(); ++group) {
//for each range
for(int crange=0; crange<prev[group].size(); ++crange) {
//for each place in that range
for(int newsplit=0; newsplit<prev[group][crange].size(); ++newsplit) {
//copy the "result"
answer_t cur = prev[group];
//slice it
range_t L = range(cur[crange].first, cur[crange].first+newsplit);
range_t R = range(cur[crange].first+newsplit), cur[crange].second);
answer_t::iterator loc = cur.erase(cur.begin()+crange);
cur.insert(loc, R);
cur.insert(loc, L);
//add it to the results
results.push_back(cur);
}
}
}
}
return results;
}

Related

Memory-constrained coin changing for numbers up to one billion

I faced this problem on one training. Namely we have given N different values (N<= 100). Let's name this array A[N], for this array A we are sure that we have 1 in the array and A[i] ≤ 109. Secondly we have given number S where S ≤ 109.
Now we have to solve classic coin problem with this values. Actually we need to find minimum number of element which will sum to exactly S. Every element from A can be used infinite number of times.
Time limit: 1 sec
Memory limit: 256 MB
Example:
S = 1000, N = 10
A[] = {1,12,123,4,5,678,7,8,9,10}. The result is 10.
1000 = 678 + 123 + 123 + 12 + 12 + 12 + 12 + 12 + 12 + 4
What I have tried
I tried to solve this with classic dynamic programming coin problem technique but it uses too much memory and it gives memory limit exceeded.
I can't figure out what should we keep about those values. Thanks in advance.
Here are the couple test cases that cannot be solved with the classic dp coin problem.
S = 1000000000 N = 100
1 373241370 973754081 826685384 491500595 765099032 823328348 462385937
251930295 819055757 641895809 106173894 898709067 513260292 548326059
741996520 959257789 328409680 411542100 329874568 352458265 609729300
389721366 313699758 383922849 104342783 224127933 99215674 37629322
230018005 33875545 767937253 763298440 781853694 420819727 794366283
178777428 881069368 595934934 321543015 27436140 280556657 851680043
318369090 364177373 431592761 487380596 428235724 134037293 372264778
267891476 218390453 550035096 220099490 71718497 860530411 175542466
548997466 884701071 774620807 118472853 432325205 795739616 266609698
242622150 433332316 150791955 691702017 803277687 323953978 521256141
174108096 412366100 813501388 642963957 415051728 740653706 68239387
982329783 619220557 861659596 303476058 85512863 72420422 645130771
228736228 367259743 400311288 105258339 628254036 495010223 40223395
110232856 856929227 25543992 957121494 359385967 533951841 449476607
134830774
OUTPUT FOR THIS TEST CASE: 5
S = 999865497 N = 7
1 267062069 637323855 219276511 404376890 528753603 199747292
OUTPUT FOR THIS TEST CASE: 1129042
S = 1000000000 N = 40
1 12 123 4 5 678 7 8 9 10 400 25 23 1000 67 98 33 46 79 896 11 112 1223 412
532 6781 17 18 19 170 1400 925 723 11000 607 983 313 486 739 896
OUTPUT FOR THIS TEST CASE: 90910
(NOTE: Updated and edited for clarity. Complexity Analysis added at the end.)
OK, here is my solution, including my fixes to the performance issues found by #PeterdeRivaz. I have tested this against all of the test cases provided in the question and the comments and it finishes all in under a second (well, 1.5s in one case), using primarily only the memory for the partial results cache (I'd guess about 16MB).
Rather than using the traditional DP solution (which is both too slow and requires too much memory), I use a Depth-First, Greedy-First combinatorial search with pruning using current best results. I was surprised (very) that this works as well as it does, but I still suspect that you could construct test sets that would take a worst-case exponential amount of time.
First there is a master function that is the only thing that calling code needs to call. It handles all of the setup and initialization and calls everything else. (all code is C#)
// Find the min# of coins for a specified sum
int CountChange(int targetSum, int[] coins)
{
// init the cache for (partial) memoization
PrevResultCache = new PartialResult[1048576];
// make sure the coins are sorted lowest to highest
Array.Sort(coins);
int curBest = targetSum;
int result = CountChange_r(targetSum, coins, coins.GetLength(0)-1, 0, ref curBest);
return result;
}
Because of the problem test-cases raised by #PeterdeRivaz I have also added a partial results cache to handle when there are large numbers in N[] that are close together.
Here is the code for the cache:
// implement a very simple cache for previous results of remainder counts
struct PartialResult
{
public int PartialSum;
public int CoinVal;
public int RemainingCount;
}
PartialResult[] PrevResultCache;
// checks the partial count cache for already calculated results
int PrevAddlCount(int currSum, int currCoinVal)
{
int cacheAddr = currSum & 1048575; // AND with (2^20-1) to get only the first 20 bits
PartialResult prev = PrevResultCache[cacheAddr];
// use it, as long as it's actually the same partial sum
// and the coin value is at least as large as the current coin
if ((prev.PartialSum == currSum) && (prev.CoinVal >= currCoinVal))
{
return prev.RemainingCount;
}
// otherwise flag as empty
return 0;
}
// add or overwrite a new value to the cache
void AddPartialCount(int currSum, int currCoinVal, int remainingCount)
{
int cacheAddr = currSum & 1048575; // AND with (2^20-1) to get only the first 20 bits
PartialResult prev = PrevResultCache[cacheAddr];
// only add if the Sum is different or the result is better
if ((prev.PartialSum != currSum)
|| (prev.CoinVal <= currCoinVal)
|| (prev.RemainingCount == 0)
|| (prev.RemainingCount >= remainingCount)
)
{
prev.PartialSum = currSum;
prev.CoinVal = currCoinVal;
prev.RemainingCount = remainingCount;
PrevResultCache[cacheAddr] = prev;
}
}
And here is the code for the recursive function that does the actual counting:
/*
* Find the minimum number of coins required totaling to a specifuc sum
* using a list of coin denominations passed.
*
* Memory Requirements: O(N) where N is the number of coin denominations
* (primarily for the stack)
*
* CPU requirements: O(Sqrt(S)*N) where S is the target Sum
* (Average, estimated. This is very hard to figure out.)
*/
int CountChange_r(int targetSum, int[] coins, int coinIdx, int curCount, ref int curBest)
{
int coinVal = coins[coinIdx];
int newCount = 0;
// check to see if we are at the end of the search tree (curIdx=0, coinVal=1)
// or we have reached the targetSum
if ((coinVal == 1) || (targetSum == 0))
{
// just use math get the final total for this path/combination
newCount = curCount + targetSum;
// update, if we have a new curBest
if (newCount < curBest) curBest = newCount;
return newCount;
}
// prune this whole branch, if it cannot possibly improve the curBest
int bestPossible = curCount + (targetSum / coinVal);
if (bestPossible >= curBest)
return bestPossible; //NOTE: this is a false answer, but it shouldnt matter
// because we should never use it.
// check the cache to see if a remainder-count for this partial sum
// already exists (and used coins at least as large as ours)
int prevRemCount = PrevAddlCount(targetSum, coinVal);
if (prevRemCount > 0)
{
// it exists, so use it
newCount = prevRemCount + targetSum;
// update, if we have a new curBest
if (newCount < curBest) curBest = newCount;
return newCount;
}
// always try the largest remaining coin first, starting with the
// maximum possible number of that coin (greedy-first searching)
newCount = curCount + targetSum;
for (int cnt = targetSum / coinVal; cnt >= 0; cnt--)
{
int tmpCount = CountChange_r(targetSum - (cnt * coinVal), coins, coinIdx - 1, curCount + cnt, ref curBest);
if (tmpCount < newCount) newCount = tmpCount;
}
// Add our new partial result to the cache
AddPartialCount(targetSum, coinVal, newCount - curCount);
return newCount;
}
Analysis:
Memory: Memory usage is pretty easy to determine for this algorithm. Basiclly there's only the partial results cache and the stack. The cache is fixed at appx. 1 million entries times the size of each entry (3*4 bytes), so about 12MB. The stack is limited to O(N), so together, memory is clearly not a problem.
CPU: The run-time complexity of this algorithm starts out hard to determine and then gets harder, so please excuse me because there's a lot of hand-waving here. I tried to search for an analysis of just the brute-force problem (combinatorial search of sums of N*kn base values summing to S) but not much turned up. What little there was tended to say it was O(N^S), which is clearly too high. I think that a fairer estimate is O(N^(S/N)) or possibly O(N^(S/AVG(N)) or even O(N^(S/(Gmean(N))) where Gmean(N) is the geometric mean of the elements of N[]. This solution starts out with the brute-force combinatorial search and then improves it with two significant optimizations.
The first is the pruning of branches based on estimates of the best possible results for that branch versus what the best result it has already found. If the best-case estimators were perfectly accurate and the work for branches was perfectly distributed, this would mean that if we find a result that is better than 90% of the other possible cases, then pruning would effectively eliminate 90% of the work from that point on. To make a long story short here, this should work out that the amount of work still remaining after pruning should shrink harmonically as it progress. Assuming that some kind of summing/integration should be applied to get a work total, this appears to me to work out to a logarithm of the original work. So let's call it O(Log(N^(S/N)), or O(N*Log(S/N)) which is pretty darn good. (Though O(N*Log(S/Gmean(N))) is probably more accurate).
However, there are two obvious holes with this. First, it is true that the best-case estimators are not perfectly accurate and thus they will not prune as effectively as assumed above, but, this is somewhat counter-balanced by the Greedy-First ordering of the branches which gives the best chances for finding better solutions early in the search which increase the effectiveness of pruning.
The second problem is that the best-case estimator works better when the different values of N are far apart. Specifically, if |(S/n2 - S/n1)| > 1 for any 2 values in N, then it becomes almost perfectly effective. For values of N less than SQRT(S), then even two adjacent values (k, k+1) are far enough apart that that this rule applies. However for increasing values above SQRT(S) a window opens up so that any number of N-values within that window will not be able to effectively prune each other. The size of this window is approximately K/SQRT(S). So if S=10^9, when K is around 10^6 this window will be almost 30 numbers wide. This means that N[] could contain 1 plus every number from 1000001 to 1000029 and the pruning optimization would provide almost no benefit.
To address this, I added the partial results cache which allows memoization of the most recent partial sums up to the target S. This takes advantage of the fact that when the N-values are close together, they will tend to have an extremely high number of duplicates in their sums. As best as I can figure, this effectiveness is approximately the N times the J-th root of the problem size where J = S/K and K is some measure of the average size of the N-values (Gmean(N) is probably the best estimate). If we apply this to the brute-force combinatorial search, assuming that pruning is ineffective, we get O((N^(S/Gmean(N)))^(1/Gmean(N))), which I think is also O(N^(S/(Gmean(N)^2))).
So, at this point take your pick. I know this is really sketchy, and even if it is correct, it is still very sensitive to the distribution of the N-values, so lots of variance.
[I've replaced the previous idea about bit operations because it seems to be too time consuming]
A bit crazy idea and incomplete but may work.
Let's start with introducing f(n,s) which returns number of combinations in which s can be composed from n coins.
Now, how f(n+1,s) is related to f(n)?
One of possible ways to calculate it is:
f(n+1,s)=sum[coin:coins]f(n,s-coin)
For example, if we have coins 1 and 3,
f(0,)=[1,0,0,0,0,0,0,0] - with zero coins we can have only zero sum
f(1,)=[0,1,0,1,0,0,0,0] - what we can have with one coin
f(2,)=[0,0,1,0,2,0,1,0] - what we can have with two coins
We can rewrite it a bit differently:
f(n+1,s)=sum[i=0..max]f(n,s-i)*a(i)
a(i)=1 if we have coin i and 0 otherwise
What we have here is convolution: f(n+1,)=conv(f(n,),a)
https://en.wikipedia.org/wiki/Convolution
Computing it as definition suggests gives O(n^2)
But we can use Fourier transform to reduce it to O(n*log n).
https://en.wikipedia.org/wiki/Convolution#Convolution_theorem
So now we have more-or-less cheap way to find out what numbers are possible with n coins without going incrementally - just calculate n-th power of F(a) and apply inverse Fourier transform.
This allows us to make a kind of binary search which can help handling cases when the answer is big.
As I said the idea is incomplete - for now I have no idea how to combine bit representation with Fourier transforms (to satisfy memory constraint) and whether we will fit into 1 second on any "regular" CPU...

How to turn integers into Fibonacci coding efficiently?

Fibonacci sequence is obtained by starting with 0 and 1 and then adding the two last numbers to get the next one.
All positive integers can be represented as a sum of a set of Fibonacci numbers without repetition. For example: 13 can be the sum of the sets {13}, {5,8} or {2,3,8}. But, as we have seen, some numbers have more than one set whose sum is the number. If we add the constraint that the sets cannot have two consecutive Fibonacci numbers, than we have a unique representation for each number.
We will use a binary sequence (just zeros and ones) to do that. For example, 17 = 1 + 3 + 13. Then, 17 = 100101. See figure 2 for a detailed explanation.
I want to turn some integers into this representation, but the integers may be very big. How to I do this efficiently.
The problem itself is simple. You always pick the largest fibonacci number less than the remainder. You can ignore the the constraint with the consecutive numbers (since if you need both, the next one is the sum of both so you should have picked that one instead of the initial two).
So the problem remains how to quickly find the largest fibonacci number less than some number X.
There's a known trick that starting with the matrix (call it M)
1 1
1 0
You can compute fibbonacci number by matrix multiplications(the xth number is M^x). More details here: https://www.nayuki.io/page/fast-fibonacci-algorithms . The end result is that you can compute the number you're look in O(logN) matrix multiplications.
You'll need large number computations (multiplications and additions) if they don't fit into existing types.
Also store the matrices corresponding to powers of two you compute the first time, since you'll need them again for the results.
Overall this should be O((logN)^2 * large_number_multiplications/additions)).
First I want to tell you that I really liked this question, I didn't know that All positive integers can be represented as a sum of a set of Fibonacci numbers without repetition, I saw the prove by induction and it was awesome.
To respond to your question I think that we have to figure how the presentation is created. I think that the easy way to find this is that from the number we found the closest minor fibonacci item.
For example if we want to present 40:
We have Fib(9)=34 and Fib(10)=55 so the first element in the presentation is Fib(9)
since 40 - Fib(9) = 6 and (Fib(5) =5 and Fib(6) =8) the next element is Fib(5). So we have 40 = Fib(9) + Fib(5)+ Fib(2)
Allow me to write this in C#
class Program
{
static void Main(string[] args)
{
List<int> fibPresentation = new List<int>();
int numberToPresent = Convert.ToInt32(Console.ReadLine());
while (numberToPresent > 0)
{
int k =1;
while (CalculateFib(k) <= numberToPresent)
{
k++;
}
numberToPresent = numberToPresent - CalculateFib(k-1);
fibPresentation.Add(k-1);
}
}
static int CalculateFib(int n)
{
if (n == 1)
return 1;
int a = 0;
int b = 1;
// In N steps compute Fibonacci sequence iteratively.
for (int i = 0; i < n; i++)
{
int temp = a;
a = b;
b = temp + b;
}
return a;
}
}
Your result will be in fibPresentation
This encoding is more accurately called the "Zeckendorf representation": see https://en.wikipedia.org/wiki/Fibonacci_coding
A greedy approach works (see https://en.wikipedia.org/wiki/Zeckendorf%27s_theorem) and here's some Python code that converts a number to this representation. It uses the first 100 Fibonacci numbers and works correctly for all inputs up to 927372692193078999175 (and incorrectly for any larger inputs).
fibs = [0, 1]
for _ in xrange(100):
fibs.append(fibs[-2] + fibs[-1])
def zeck(n):
i = len(fibs) - 1
r = 0
while n:
if fibs[i] <= n:
r |= 1 << (i - 2)
n -= fibs[i]
i -= 1
return r
print bin(zeck(17))
The output is:
0b100101
As the greedy approach seems to work, it suffices to be able to invert the relation N=Fn.
By the Binet formula, Fn=[φ^n/√5], where the brackets denote the nearest integer. Then with n=floor(lnφ(√5N)) you are very close to the solution.
17 => n = floor(7.5599...) => F7 = 13
4 => n = floor(4.5531) => F4 = 3
1 => n = floor(1.6722) => F1 = 1
(I do not exclude that some n values can be off by one.)
I'm not sure if this is an efficient enough for you, but you could simply use Backtracking to find a(the) valid representation.
I would try to start the backtracking steps by taking the biggest possible fib number and only switch to smaller ones if the consecutive or the only once constraint is violated.

Find all number pairs in a given range

I have N numbers let say 20 30 15 30 30 40 15 20. Now I want to find how many numbers pairs are in a given range.(L and R given).
number pair= both numbers are same.
My approach:
Create a Map of Array, such that key of map= number, and value=ArrayList of indexes at which that number appears. Then I traverse from L to R and for each value in that range I traverse in the corresponding arraylist to find if there is a pair that fits in range, and then increment count.
But I think this approach is too slow. Is there some faster method to do the same?
Example: for above given sequence and L=0 and R=6
Answer=5. Possible pairs are 1 for 20, 1 for 15 and 3 for 30.
I am developing a solution, assuming numbers can be upto 10^8( and non negative).
If you are looking for speed and don't care about memory there's maybe a better way.
You can use a set as an auxiliary data structure to see if a number was found, and then simply walk the array. Pseudo code:
int numPairs = 0;
set setVisited;
for (int i = L; i < R; i++) {
if (setVisited.contains(a[i])) {
// found the second of a pair. count it up and reset.
numPairs++;
setVisited.remove(a[i]);
} else {
// remember that we saw this number, so we can spot the next pair.
setVisited.add(a[i]);
}
New solution... hopefully better this time. Psuedo C-ish code:
// Sort the sub-array a[L..R]. This can be done O(nlogn) using qsort.
// ... code omitted ...
// Walk through the sorted array counting how many times number occurs.
// When the number changes, count how many possibles ways to make pairs
// from the given count.
int totalPairs = 0;
int count = 1;
int current = a[L];
for (i = L+1; i < R; i++) {
if (a[i] == current) { // found another, keep counting
count++;
} else { // found a different one
if (count > 1) { // need at least 2 to make a pair!
totalPairs += factorial(count) / 2;
}
}
// start counting the new one
current = a[i];
count = 1;
}
// count the final one
if (count > 1) {
totalPairs += factorial(count) / 2;
}
The sort runs O(nlgn), and the loop body runs O(n). Interestingly the performance barrier is now factorial. For really long arrays with really high numbers of occurrences, factorial is expensive unless you optimize further.
One way would be to have loop count repetitions but not compute factorial yet -- leave yet another array of counts of numbers. Then sort this array (again Nlg(N)), then walk through this array and re-use previously computed factorial to compute the next one.
Also if this array gets big, you'll need a large integer to represent the total. I don't know the O() performance of large integers off the top of my head.
Cool problem!

Finding if a random number has occured before or not

Let me be clear at start that this is a contrived example and not a real world problem.
If I have a problem of creating a random number between 0 to 10. I do this 11 times making sure that a previously occurred number is not drawn again, if I get a repeated number,
I create another random number again to make sure it has not be seen earlier. So essentially I get a a sequence of unique numbers from 0 - 10 in a random order
e.g. 3 1 2 0 5 9 4 8 10 6 7 and so on
Now to come up with logic to make sure that the random numbers are unique and not one which we have drawn before, we could use many approaches
Use C++ std::bitset and set the bit corresponding to the index equal to value of each random no. and check it next time when a new random number is drawn.
Or
Use a std::map<int,int> to count the number of times or even simple C array with some sentinel values stored in that array to indicate if that number has occurred or not.
If I have to avoid these methods above and use some mathematical/logical/bitwise operation to find whether a random number has been draw before or not, is there a way?
You don't want to do it the way you suggest. Consider what happens when you have already selected 10 of the 11 items; your random number generator will cycle until it finds the missing number, which might be never, depending on your random number generator.
A better solution is to create a list of numbers 0 to 10 in order, then shuffle the list into a random order. The normal algorithm for doing this is due to Knuth, Fisher and Yates: starting at the first element, swap each element with an element at a location greater than the current element in the array.
function shuffle(a, n)
for i from n-1 to 1 step -1
j = randint(i)
swap(a[i], a[j])
We assume an array with indices 0 to n-1, and a randint function that sets j to the range 0 <= j <= i.
Use an array and add all possible values to it. Then pick one out of the array and remove it. Next time, pick again until the array is empty.
Yes, there is a mathematical way to do it, but it is a bit expansive.
have an array: primes[] where primes[i] = the i'th prime number. So its beginning will be [2,3,5,7,11,...].
Also store a number mult Now, once you draw a number (let it be i) you check if mult % primes[i] == 0, if it is - the number was drawn before, if it wasn't - then the number was not. chose it and do mult = mult * primes[i].
However, it is expansive because it might require a lot of space for large ranges (the possible values of mult increases exponentially
(This is a nice mathematical approach, because we actually look at a set of primes p_i, the array of primes is only the implementation to the abstract set of primes).
A bit manipulation alternative for small values is using an int or long as a bitset.
With this approach, to check a candidate i is not in the set you only need to check:
if (pow(2,i) & set == 0) // not in the set
else //already in the set
To enter an element i to the set:
set = set | pow(2,i)
A better approach will be to populate a list with all the numbers, shuffle it with fisher-yates shuffle, and iterate it for generating new random numbers.
If I have to avoid these methods above and use some
mathematical/logical/bitwise operation to find whether a random number
has been draw before or not, is there a way?
Subject to your contrived constraints yes, you can imitate a small bitset using bitwise operations:
You can choose different integer types on the right according to what size you need.
bitset code bitwise code
std::bitset<32> x; unsigned long x = 0;
if (x[i]) { ... } if (x & (1UL << i)) { ... }
// assuming v is 0 or 1
x[i] = v; x = (x & ~(1UL << i)) | ((unsigned long)v << i);
x[i] = true; x |= (1UL << i);
x[i] = false; x &= ~(1UL << i);
For a larger set (beyond the size in bits of unsigned long long), you will need an array of your chosen integer type. Divide the index by the width of each value to know what index to look up in the array, and use the modulus for the bit shifts. This is basically what bitset does.
I'm assuming that the various answers that tell you how best to shuffle 10 numbers are missing the point entirely: that your contrived constraints are there because you do not in fact want or need to know how best to shuffle 10 numbers :-)
Keep a variable too map the drawn numbers. The i'th bit of that variable will be 1 if the number was drawn before:
int mapNumbers = 0;
int generateRand() {
if (mapNumbers & ((1 << 11) - 1) == ((1 << 11) - 1)) return; // return if all numbers have been generated
int x;
do {
x = newVal();
} while (!x & mapNumbers);
mapNumbers |= (1 << x);
return x;
}

Interview Question: Find Median From Mega Number Of Integers

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

Resources