I have an array which contains a list of different sizes of materials : {4,3,4,1,7,8} . However, the bin can accomodate materials upto size 10. I need to find out the minimum number of bins needed to pack all the elements in the array.
For the above array, you can pack in 3 bins and divide them as follows: {4,4,1}, {3,7} , {8} . There are other possible arrangements that also fit into three stock pipes, but it cannot be done with fewer
I am trying to solve this problem through recursion in order to understand it better.
I am using this DP formulation (page 20 of the pdf file)
Consider an input (n1;:::;nk) with n = ∑nj items
Determine set of k-tuples (subsets of the input) that can be packed into a single bin
That is, all tuples (q1;:::;qk) for which OPT(q1;:::;qk) = 1
Denote this set by Q For each k-tuple q , we have OPT(q) = 1
Calculate remaining values by using the recurrence : OPT(i1;:::;ik) = 1 +
minOPT(i1 - q1;:::;ik - qk)
I have made the code, and it works fine for small data set. But if increase the size of my array to more than 6 elements, it becomes extremely slow -- takes about 25 seconds to solve an array containing 8 elements Can you tell me if theres anything wrong with the algorithm? I dont need an alternative solution --- just need to know why this is so slow, and how it can be improved
Here is the code I have written in C++ :
void recCutStock(Vector<int> & requests, int numStocks)
{
if (requests.size() == 0)
{
if(numStocks <= minSize)
{
minSize = numStocks;
}
// cout<<"GOT A RESULT : "<<numStocks<<endl;
return ;
}
else
{
if(numStocks+1 < minSize) //minSize is a global variable initialized with a big val
{
Vector<int> temp ; Vector<Vector<int> > posBins;
getBins(requests, temp, 0 , posBins); // 2-d array(stored in posBins) containing all possible single bin formations
for(int i =0; i < posBins.size(); i++)
{
Vector<int> subResult;
reqMinusPos(requests, subResult, posBins[i]); // subtracts the initial request array from the subArray
//displayArr(subResult);
recCutStock(subResult, numStocks+1);
}
}
else return;
}
}
The getBins function is as follows :
void getBins(Vector<int> & requests, Vector<int> current, int index, Vector<Vector<int> > & bins)
{
if (index == requests.size())
{
if(sum(current,requests) <= stockLength && sum(current, requests)>0 )
{
bins.add(current);
// printBins(current,requests);
}
return ;
}
else
{
getBins(requests, current, index+1 , bins);
current.add(index);
getBins(requests, current, index+1 , bins);
}
}
The dynamic programming algorithm is O(n^{2k}) where k is the number of distinct items and n is the total number of items. This can be very slow irrespective of the implementation. Typically, when solving an NP-hard problem, heuristics are required for speed.
I suggest you consider Next Fit Decreasing Height (NFDH) and First Fit Decreasing Height (FFDH) from Coffman et al. They are 2-optimal and 17/10-optimal, respectively, and they run in O(n log n) time.
I recommend you first try NFDH: sort in decreasing order, store the result in a linked list, then repeatedly try to insert the items starting from the beginning (largest values first) until you have filled the bin or there is no more items that can be inserted. Then go to the next bin and so on.
References:
Owen Kaser, Daniel Lemire, Tag-Cloud Drawing: Algorithms for Cloud Visualization, Tagging and Metadata for Social Information Organization (WWW 2007), 2007. (See Section 5.1 for a related discussion.)
E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, and R. E. Tarjan. Performance bounds for level-oriented two-dimensional packing algorithms. SIAM J. Comput., 9(4):808–826, 1980.
But if increase the size of my array to more than 6 elements, it
becomes extremely slow -- takes about 25 seconds to solve an array
containing 8 elements Can you tell me if theres anything wrong with
the algorithm?
That's normal with brute force. Brute force does not scale at all.
In your case: Bin size = 30, total items = 27, at least 3 bins are needed.
You could try first fit decreasing, and it works!
More ways to improve: With 3 bins and 27 size units, you will have 3 units of space left over. Which means you can ignore the item of size 1; if you fit the others into 3 bins, it will fit somewhere. That leaves you with 26 size units. That means you will have at least two units empty in one bin. If you had items of size 2, you could ignore them as well because they would fit. If you had two items of size 2, you could ignore items of size 3 as well.
You have two items of size 7 + 3 which is exactly the bin size. There is always an optimal solution where these two are in the same bin: If the "7" were with other items, their size would be 3 or less, so you could swap them with the "3" if it is in another bin.
Another method: If you have k items >= bin size / 2 (you can't have two items equal to bin size / 2 at this point), then you need k bins. This might increase the minimum number of bins that you estimated initially which in turn increases the guaranteed empty space in all bins which increases the minimum size of leftover space in one bin. If for j = 1, 2, ..., k you can fit all items with them into j bins that could possibly fit into the same bin, then this is optimal. For example, if you had sizes 8, 1, 1 but no size 2, then 8+1+1 in a bin would be optimal. Since you have 8 + 4 + 4 left, and nothing fits with the 8, "8" alone in its bin is optimal. (If you had items of sizes 8, 8, 8, 2, 1, 1, 1 and nothing else of size 2, packing them into three bins would be optimal).
More things to try if you have large items: If you have a large item, and the largest item that fits with it is as large or larger than any combination of items that would fit, then combining them is optimal. If there is more space, then this can be repeated.
So all in all, a bit of thinking reduced the problem to fitting two items of sizes 4, 4 into one or more bins. With larger problems, every little bit helps.
I've written a bin-packing solution and I can recommend best-fit with random order.
After doing what you can to reduce the problem, you are left with the problem to fit n items into k bins if possible, into k + 1 bins otherwise, or into k + 2 bins etc. If k bins fail, then you know that you will have more empty space in an optimal solution of k + 1 bins, which may make it possible to remove more small items, so that's the first thing to do.
Then you try some simple methods: First fit descending, next fit descending. I tried a reasonably fast variation of first fit descending: As long as the two largest items fit, add the largest item. Otherwise, find the single item or the largest combination of two items that fit, and add the single item or the larger of that combination. If any of these algorithms fits your items into k bins, you solved the problem.
And eventually you need brute force. You can decide: Do you attempt to fit everything into k bins, or do you attempt to prove it isn't possible? You will have some empty space to play with. Let's say 10 bins of size 100 and items of total size 936, that would leave you 64 units of empty space. If you put only items of size 80 into your first bin, then 20 of your 64 units are already gone, making it much harder to find a solution from there. So you don't try things in random order. You first try combinations for the first bin that fill it completely or close to completely. Since small items make it easier to fill containers completely you try not to use them in the first bins but leave them for later, when you have less choice of items. And when you've found items to put into a bin, try one of the simple algorithms to see if they can finish it. For example, if first fit descending put 90 units into the first bin, and you just managed to put 99 units in there, it is quite possible that this is enough improvement to fit everything.
On the other hand, if there is very little space (10 bins, total item size 995 for example) you may want to prove that fitting the items is not possible. You don't need to care about optimising the algorithm to find a solution quickly, because you need to try all combinations to see they don't work. Obviously you need with these numbers to fit at least 95 units into the first bin and so on; that might make it easy to rule out solutions quickly. If you proved that k bins are not achievable, then k+1 bins should be a much easier target.
Related
I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}
Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657
The Problem statement:
Assurance Company of Moving (ACM) is a company of moving things for people. Recently, some schools want to move their computers to another place. So they ask ACM to help them. One school reserves K trucks for moving, and it has N computers to move. In order not to waste the trucks, the school ask ACM to use all the trucks. That is to say, there must be some computers in each truck, and there are no empty trucks. ACM wants to know how many partition shemes exists with moving N computers by K trucks, the ACM ask you to compute the number of different shemes with given N and K. You needn't care with the order. For example N=7,K=3, the the following 3 partition instances are regarded as the same one and should be counted as one sheme: "1 1 5","1 5 1","5 1 1". Each truck can carry almost unlimited computers!!
Save Time :
You have to count how many sequences a[1..k] exist such that :
1) a[i] + a[2] + .... + a[k] = N such that permutations dont matter
My O(N*K^2) solution (Cannot figure out how to improve on it)
#include<assert.h>
#include<stdio.h>
#include<algorithm>
using namespace std;
int DP[5001][5001];
void ini()
{
int i,j,k;
DP[0][0]=1;
for(k=1;k<=500;k++)
for(j=1;j<=500;j++)
for(i=1;i<=500;i++)
{
DP[i][j]+=j>=k?DP[i-1][j-k]:0;
DP[i][j]%=1988;
}
return ;
}
int main()
{
ini();
int N,K,i,j;
while(1)
{
scanf("%d%d",&N,&K);
if(N==0 && K==0)
return 0;
int i;
if(DP[K][N]==0)
{assert(0);}
printf("%d\n",DP[K][N]);
}
return 0;
}
Explanation of my solution DP[i][j] represents the number of ways I can have total j computers using i Trucks only.
The k represents the number of computers with which I am dealing with that means I am just avoiding permutations!
How can I improve it to O(N*K)?
Problem constraints
N (1<=N<=5000) and K(1<=K<=N)
Problem Link: Problem Spoj
Just say that you have K gift boxes and N chocolates.
I will start with a recursive and real easy to convert it to iterative solution.
The key to avoid repetitions is distributing chocolates in a ascending order (descending also works). So you 7 chocolates and I put 2 chocolate in the first box, I will put at least 2 in the second box. WHY? this helps in avoiding repetitions.
now onwards TCL = totalChocholatesLeft & TBL = totalBinsLeft
So S(TCL,TBL) = S(TCL-TBL,TBL) + S(TCL,TBL-1);
you have to call the above expression starting with S(n-k), k)
Why? because all boxes need at least one item so first put `1` each box.
Now you are left with only `n-k` chocolates.
That's all! that's the DP recursion.
How does it work?
So in order to remove repetitions we are maintaning the ascending order.
What is the easiest way to maintain the ascending order ?
If you put 1 chocolate in the ith box, put 1 in all boxes in front of it i+1, i++2 .....k.
So after keeping chocolate in a gift box, you have two choices :
Either you want to continue with current box :
S(TCL-TBL,TBL) covers this
or to move the next box just never consider this box again
S(TCL,TBL-1) covers this.
Equivalent DP would make have TC : O(NK)
This problem is equivalent to placing n-k identical balls (after already placing one ball in each cell to make sure it's not empty) in k identical cells.
This can be solved using the recurrence formula:
D(n,0) = 0 n > 0
D(n,k) = 0 n < 0
D(n,1) = 1 n >= 0
D(n,k) = D(n,k-1) + D(n-k,k)
Explanation:
Stop clauses:
D(n,0) - no way to put n>0 balls in 0 cells
D(n<0,k) - no way to put negative number of balls in k cells
D(n,1) - one way to put n balls in 1 cell: all in this cell
Recurrence:
We have two choices.
We either have one (or more) empty cell, so we recurse with the same problem, and one less cell: D(n,k-1)
Otherwise, we have no empty cells, so we put one ball in each cell, recurse with the same number of cells and k less balls, D(n-k,k)
The two possibilities are of disjoint sets, so the union of both sets is the summation of the two sizes, thus D(n,k) = D(n,k-1) + D(n-k,k)
The above recursive formula is easy to compute in O(1) (assuming O(1) arithmetics), if the "lower" problems are known, and the DP solution needs to fill a table of size (n+1)*(k+1), so this solution is O(nk)
As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.
There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
}
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
})
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
})
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
}
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
EDIT:
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
srand(time(NULL));
while (i<ARRAY_SIZE && k!=1.) {
X_new=0;
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
}
X_new/=RANGE_SIZE;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
}
else {
X_global = X_new;
}
i+=RANGE_SIZE+1;
iter++;
printf("iter %d, median = %d \n",iter,X_global);
}
return 0;
}
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
/**
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
*/
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
zeroBin++;
}
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
}
zeroBin = 0;
}
return bitMask;
}
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.