I have a problem with nesting of lengths please suggest me the best way of solving this. my problem is as follows.
I have some standard length lets say Total length (This is the total length which we need to fill with some specific length blocks)
input is list of length blocks eg: 5000, 4000, 3000
and gap between each block is a range eg: 200 to 500 (this gap can be adjusted with in the range)
Now we have to fill the Total length with the above available length blocks with gap between each block and that gap should be with in the gap range given above.
Please suggest me some way of solving this problem.
This problem is essentially the Subset sum problem with a small twist. You can use the same pseudo-polynomial time solution as subset sum: create an array of length "total length" (which we call n from now on) and, for each length k, add it to every existing length in the array (so if element m is populated, create a new entry at m + k (if m + k ≤ n), but leave the existing one at m as well), as well as creating a new array entry at location k representing the creation of a new set. You can build up a set of entries at array element i to represent the set of lists of length blocks totaling i. Each set entry should link back to the array entry it came from, which it can do by simply storing the last length that got you there. This is similar to a question I answered recently here, which you can adjust to allow duplicates as you see fit.
However, you need to modify the above approach to account for the gaps. Let's call the minimum gap between each element x and the maximum gap y. When adding an entry of length k, we include the minimum gap whenever adding it to another entry (so if m is populated, we actually create the entry at m + k + x). We continue to create the initial entry at k because we include the gaps between elements. When we create an entry, we can also determine if it fills the space. Suppose the new entry contains t elements and has total m. Then it fills the space iff m ≥ n - t ( y - x ). If it fills the space, we should add it to a solution list. Depending on how many solutions you want, you can terminate the algorithm as soon as enough solutions are found, or let it find all solutions. At the end, simply iterate through the solutions list.
Anything within the range can represent its gaps in any of a number of different ways, but one way that works is to greedily allocate the "slack" - for example, if you are 1000 away from the new total length using your above example, you can pick the first three gaps to be 500 (which is 300 extra slack each, for 900 total extra) and then the fourth to be 300 (for the extra 100 slack, totaling 1000) and every additional gap should be minimum (200).


Bin Packing with Overflow

I have N bins (homogenous or heterogeneous in size, depending on variant of task) in which I am trying to fit M items (always heterogeneous in size). Items can be larger than a single bin and are allowed to overflow to the next bin(s) (but no wrap around from bin N-1 to 0).
The more bins an item spans, the higher its allocation cost.
I want to minimize the overall allocation cost of fitting all M into N bins.
Everything is static. It is guaranteed that all M fit in N.
I think I am looking for a variant of the Bin Packing algorithm. But any hints towards an existing solution/approximation/good heuristic are appreciated.
My current approach looks like this:
sort items by size
for i in items:
for b in bins:
try allocation of i starting at b
if allocation valid:
record cost
do allocation of i in b with lowest recorded cost
update all b fill level
So basically a greedy by size approach with O(MxNxC) runtime, where C~"longest allocation across banks" (try allocation takes C time).
I would suggest dynamic programming for the exact solution. To make it easier to visualize, assume each bin's size is the length of an array of cells. Since they are contiguous, you can visualize them as a contiguous set of arrays, e.g.
the delimiters of the arrays are || and the positions in the arrays are given by x. so this has 4 arrays, arr_0 of size 3, arr_1 of size 2 and so on.
If you place an item at position i, it will ocuppy position i to i+(h-1), where h is the size of the items. E.g. if items are of size h=5, and you place an item at position 1, you would get
One trick to use is that if we introduce the additional constraint that the items need to be inserted “in order”, i.e. the first inserted is the leftmost inserted item, the second is the second-leftmost inserted item, etc. then this problem will have the same optimal solution solution as the original one (since we can just take the optimal solution and insert it in order).
Consider pos(k-1,i) to be the optimal position to insert the k-th object
Given that we the k-1th object ended at (I-1). opt_c(k-i,i) the optimal extra-cost of inserting the k-1…N pieces, given that the k-1th object ended at (I-1).
Then pos(N-1,i) can be easily calculated by running through the cells and calculating the extra-cost (and even easier by noting that at least one of the borders should match up with pos(N-1,i), but to make analysis easier we will evaluate the extra-cost each of the I…NumCells-h). opt_c(N-1,i) equals this extra-cost.
pos(N-2,i) = argmin_x extra_cost(x,i, pos(N-1,x+h)) + opt(N-1,x+h)
Where extra_cost is the extra_cost(x,i,j) of inserting at x, given that the last inserted object ended at I-1 and the next inserted object will start at j.
And by substituting x = opt(N-2,i)
opt(N-2,i) = extra_cost(opt(N-2,i),I, pos(N-1,opt(N-2,i)+h)) + opt(N-1,opt(N-2,i)+h)
By induction, for all 1<=W<N-1
pos(W,i) = argmin_x extra_cost(x,i, pos(W+1,x+h)) + opt(W+1,x+h)
opt(W,i) = extra_cost( pos(W,i),I, pos(N+1, pos(W,i)+h)) + opt(N-1, pos(W,i)+h)
And your final result is given by the minimal over i of all opt(0,i).

Randomly sample a data set

I came across a Q that was asked in one of the interviews..
Q - Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
I am looking for -
What does random sampling of a data set mean?
(I mean I can simply do a coin toss and select a string from input if outcome is 1 and do this until i have 1000 samples..)
What are things I need to consider while doing so? For example .. taking contiguous strings may be better than taking non-contiguous strings.. to rephrase - Is it better if i pick contiguous 1000 strings randomly.. or is it better to pick one string at a time like coin toss..
This may be a vague question.. I tried to google "randomly sample data set" but did not find any relevant results.
Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?
I read this post -
which answers this Q quite clearly..
Let me put the summary here -
Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.
Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream.
Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001.
If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.
I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.
Reservoir sampling is the way to go though the analysis from #abipc seems in the right direction but is not completely correct.
It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by #abipc works fine, the probability calculations goes like this -
After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.
p(any other element remaining in the sample) = [ 1 - p(that element is
removed from sample)]
= [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
= [ 1 - (1000/1001) * (1/1000)] = 1000/1001
Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.
As I know such class of algorithms is called Reservoir Sampling algorithms.
I know one of it from DataMining, but don't know the name of it:
Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1
It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.
So for example S=10;
S - is finite number;
N_you_have_seen - can be infinite number;

Counting ways of placing coins on a grid

the problem requires us to find out the number of ways of placing R coins on a N*M grid such that each row and column has at least one coin. Constraints given are N , M < 200 , R < N*M. I initially thought of backtracking, but i was made to realise that it would never finish in time . Can someone guide me to another solution? (DP , closed form formula.) any pointers would be nice. Thanks.
According to OEIS sequence A055602 one possible solution to this is:
Let a(m, n, r) = Sum_{i=0..m} (-1)^i*binomial(m, i)*binomial((m-i)*n, r)
Answer = Sum_{i=0..N} (-1)^i*binomial(N, i)*a(M, N-i, R)
You will need to evaluate N+1 different values for a.
Assuming you have precomputed binomial coefficients, each evaluation of a is O(M) so the total complexity is O(NM).
This formula can be derived using the inclusion-exclusion principle twice.
a(m,n,r) is the number of ways of putting r coins on a grid of size m*n such that every one of the m columns is occupied, but not all the rows are necessarily occupied.
Inclusion-Exclusion turns this into the correct answer. (The idea is that we get our first estimate from a(M,N,R). This overestimates the correct answer because not all rows are occupied so we subtract cases a(M,N-1,R) where we only occupy N-1 rows. This then underestimates so we need to correct again...)
Similarly we can compute a(m,n,r) by considering b(m,n,r) which is the number of ways of placing r coins on a grid where we don't care about rows or columns being occupied. This can be derived simply from the number of ways of choosing r places in a grid size m*n , i.e. binomial(m*n,r). We use IE to turn this into the function a(m,n,r) where we know that all columns are occupied.
If you want to allow different conditions on the number of coins on each square, then you can just change b(m,n,r) to the appropriate counting function.
This is tough, but if you begin by working out how many ways you can have at least one coin on each row and column (call them reserve coins). The answer will be the product of #1 (n! / r! (n - r)!) *, where #2 n = N*M - NUMBER_OF_RESERVE_COINS and #3 r = (R - NUMBER_OF_RESERVE_COINS) for #4 each arrangement of reserving one coin on each row/column.
#4 is where the trickier stuff takes place. For N*M where N!=M, abs(N-M) tells you how many reserve coins will be on a single rows/columns. I'm having trouble on identifying the correct way of proceeding to the next step, mainly due to lack of time (though I can return to this on the weekend), but I hope I have provided you with useful information, and if what I have said is correct that you will be able to complete the process.

Finding the Nth largest value in a group of numbers as they are generated

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.
Note: The numbers are decimals and the list can include duplicates.
[Edit]: My memory limit is 16 MB.
This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).
First pass:
Find the highest value and the lowest value. That's your initial range.
Passes after the first:
Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
The winning bin's top and bottom range are your new range.
Loop through these steps again until you have enough memory to hold the numbers in the current bin.
Last pass:
You should know how many numbers are above the current bin by now.
You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.
Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:
(- 0.0 - 100.0]
(100.0 - 200.0]
(200.0 - 300.0]
(900.0 - 1000.0)
If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:
(100.0 - 110.0]
(110.0 - 120.0]
Another multipass idea:
Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).
Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.
It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.
You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.
If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.
This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.
Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.
Imagine a 1 billion bit-vector stored as an array of 32-bit words
Let it be initialized to all zeros
Start running through your numbers and keep setting the correct bit position for the value of the number
When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
That bit's position gives you the value for your Nth largest number
You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)
If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.
sort -n | uniq -c and the Nth should be the Nth row

Best item from list based on 3 variables

Say I have the following for a bunch of items.
item position
item size
item length
A smaller position is better, but a larger length and size are better.
I want to find the item that has the smallest position, largest length and size.
Can I simply calculate a value such as (total - position) * size * length for each item, and then find the item with the largest value? Would it be better to work off percentages?
Either add a fourth item, which is your calculated value of 'goodness', and sort by that OR if your language of choice allows, override the comparason operators for sorting to use your formula and then sort. Note that the latter approach means that the function to determine betterness will be applied multiple times per item in the list, but it has the advantage of ease of making a procedural comparason possible (eg first look at the position, then if that is equal, look at size then length) - athough this could also be expressed as a formula resulting in a single number to sort by.
As for your proposed formula, note that each item has the same numerical weight even though they are measured on completely unrelated scales. Furthermore, all items with either position=total, size=0 or length=0 evaluate to zero.
If what you want is that position is the most important thing, but given equal positions, size is the next most important thing, but given equal positions and sizes, then go by length, this can be formulated into a single number as follows:
(P-position)*(S*L) + size*L + length
where L is a magic number that is greater than the maximum possible length value, S is a number greater than the maximum possible size value, and P is a number greater than the maximum possible position value.
If, on the other hand, what you want is some scale where the items are of whatever relative importances, one possible formula looks like this:
((P-position)/P)*pScale * (size/S)*sScale * (length/L)*lScale
In this version, P, S and L have much the same definitions as before - but it is very inmportant that the values of P, S and L are meaningful in a compatible way, e.g all very close to expected maximum values. pScale, sScale and lScale are there so you can essentially specify the relative importance of each item. They could all be 1 if all atems are equally important, in which case you could leave them out entirely.
As previously answered, though, there are also a potentially infinite number of other ways you could choose to code this. As a random example, for large sizes, length could become less important; those possibilities would require much additional thought as to what is actually meant by such a vague statement.
