Best item from list based on 3 variables - sorting

Say I have the following for a bunch of items.
item position
item size
item length
A smaller position is better, but a larger length and size are better.
I want to find the item that has the smallest position, largest length and size.
Can I simply calculate a value such as (total - position) * size * length for each item, and then find the item with the largest value? Would it be better to work off percentages?

Either add a fourth item, which is your calculated value of 'goodness', and sort by that OR if your language of choice allows, override the comparason operators for sorting to use your formula and then sort. Note that the latter approach means that the function to determine betterness will be applied multiple times per item in the list, but it has the advantage of ease of making a procedural comparason possible (eg first look at the position, then if that is equal, look at size then length) - athough this could also be expressed as a formula resulting in a single number to sort by.
As for your proposed formula, note that each item has the same numerical weight even though they are measured on completely unrelated scales. Furthermore, all items with either position=total, size=0 or length=0 evaluate to zero.
If what you want is that position is the most important thing, but given equal positions, size is the next most important thing, but given equal positions and sizes, then go by length, this can be formulated into a single number as follows:
(P-position)*(S*L) + size*L + length
where L is a magic number that is greater than the maximum possible length value, S is a number greater than the maximum possible size value, and P is a number greater than the maximum possible position value.
If, on the other hand, what you want is some scale where the items are of whatever relative importances, one possible formula looks like this:
((P-position)/P)*pScale * (size/S)*sScale * (length/L)*lScale
In this version, P, S and L have much the same definitions as before - but it is very inmportant that the values of P, S and L are meaningful in a compatible way, e.g all very close to expected maximum values. pScale, sScale and lScale are there so you can essentially specify the relative importance of each item. They could all be 1 if all atems are equally important, in which case you could leave them out entirely.
As previously answered, though, there are also a potentially infinite number of other ways you could choose to code this. As a random example, for large sizes, length could become less important; those possibilities would require much additional thought as to what is actually meant by such a vague statement.

Related

Bin Packing with Overflow

I have N bins (homogenous or heterogeneous in size, depending on variant of task) in which I am trying to fit M items (always heterogeneous in size). Items can be larger than a single bin and are allowed to overflow to the next bin(s) (but no wrap around from bin N-1 to 0).
The more bins an item spans, the higher its allocation cost.
I want to minimize the overall allocation cost of fitting all M into N bins.
Everything is static. It is guaranteed that all M fit in N.
I think I am looking for a variant of the Bin Packing algorithm. But any hints towards an existing solution/approximation/good heuristic are appreciated.
My current approach looks like this:
sort items by size
for i in items:
for b in bins:
try allocation of i starting at b
if allocation valid:
record cost
do allocation of i in b with lowest recorded cost
update all b fill level
So basically a greedy by size approach with O(MxNxC) runtime, where C~"longest allocation across banks" (try allocation takes C time).
I would suggest dynamic programming for the exact solution. To make it easier to visualize, assume each bin's size is the length of an array of cells. Since they are contiguous, you can visualize them as a contiguous set of arrays, e.g.
|xxx|xx|xxx|xxxx|
the delimiters of the arrays are || and the positions in the arrays are given by x. so this has 4 arrays, arr_0 of size 3, arr_1 of size 2 and so on.
If you place an item at position i, it will ocuppy position i to i+(h-1), where h is the size of the items. E.g. if items are of size h=5, and you place an item at position 1, you would get
|xb|bb|bbx|xxxx|
One trick to use is that if we introduce the additional constraint that the items need to be inserted “in order”, i.e. the first inserted is the leftmost inserted item, the second is the second-leftmost inserted item, etc. then this problem will have the same optimal solution solution as the original one (since we can just take the optimal solution and insert it in order).
Consider pos(k-1,i) to be the optimal position to insert the k-th object
Given that we the k-1th object ended at (I-1). opt_c(k-i,i) the optimal extra-cost of inserting the k-1…N pieces, given that the k-1th object ended at (I-1).
Then pos(N-1,i) can be easily calculated by running through the cells and calculating the extra-cost (and even easier by noting that at least one of the borders should match up with pos(N-1,i), but to make analysis easier we will evaluate the extra-cost each of the I…NumCells-h). opt_c(N-1,i) equals this extra-cost.
Similarly,
pos(N-2,i) = argmin_x extra_cost(x,i, pos(N-1,x+h)) + opt(N-1,x+h)
Where extra_cost is the extra_cost(x,i,j) of inserting at x, given that the last inserted object ended at I-1 and the next inserted object will start at j.
And by substituting x = opt(N-2,i)
opt(N-2,i) = extra_cost(opt(N-2,i),I, pos(N-1,opt(N-2,i)+h)) + opt(N-1,opt(N-2,i)+h)
By induction, for all 1<=W<N-1
pos(W,i) = argmin_x extra_cost(x,i, pos(W+1,x+h)) + opt(W+1,x+h)
And
opt(W,i) = extra_cost( pos(W,i),I, pos(N+1, pos(W,i)+h)) + opt(N-1, pos(W,i)+h)
And your final result is given by the minimal over i of all opt(0,i).

How to make a uniform random distribution but where result is revealed in steps?

For example, let's say there is a array of items each equally likely to be chosen, and the output of this random function will tell which item to be chosen, but I want the function to be split into multiple steps so that along each step the list of potential items is narrowed in giving better insight on the result probabilities.
Here's a step by step example of how it might work:
Step 1: Every item is 1/1000 chance.
Step 2: Random subset of half the original set is removed, so each remaining item is 1/500 now.
Step 3: Repeat step 2 until narrowed down to a single item.
The requirements I'd like for the algorithm is < O(n) time complexity and at each step the distribution is still uniformly random.
Initially I though to have an algorithm which:
Start with variables min and max describing the current range of values left.
Shrink the range by generating random float number between [-1, 1] which is applied to the range to shrink it proportionally. If random number is negative then lower the max, otherwise raise the min. So 50% of the time it is shifting the min up, and shifting the max down, and the range is shrinking by a factor between [0,1].
Repeat 2. until range converges on a single number.
But I noticed this doesn't have a uniform distribution, and instead it is more common for the chosen result to be closer to starting min and max values. So to fix this I think one could add a preliminary step where the starting range is offset by another random value. But this would only fix in making the starting distribution uniformly random, and it still doesn't fit my requirement of making it uniformly random at every step.
The naive solution is to generate random numbers and remove those from the list until at each step, but that is a O(n) solution so I hope there is something better.
You just have to apply Bayes' Theorem.
If you randomly remove a portion p of the remaining possibilities, the remaining items have their probabilities multiplied by 1/(1-p). So in your step 2, the probabilities change by an amount corresponding to how much the range changed. And not by a fixed factor.
This problem has some very simple answers so maybe that is why people seemed confused.
One solution is to generate a random number between [0,n] where n is the number of items in the current set, and instead of just removing it, you remove a range of items around that point.
Solution two is a bit more complicated but has the property of preserving set order + location such that the resulting set is just a spliced section of the original set, wheras the first solution's resulting set could be made of up multiple sections of the original set. The method here is as described initially in my post, but you also apply the random offset during each turn, not just once at the beginning.

Nesting algorithm

I have a problem with nesting of lengths please suggest me the best way of solving this. my problem is as follows.
I have some standard length lets say Total length (This is the total length which we need to fill with some specific length blocks)
input is list of length blocks eg: 5000, 4000, 3000
and gap between each block is a range eg: 200 to 500 (this gap can be adjusted with in the range)
Now we have to fill the Total length with the above available length blocks with gap between each block and that gap should be with in the gap range given above.
Please suggest me some way of solving this problem.
Thank in advance...
Regards,
Anil
This problem is essentially the Subset sum problem with a small twist. You can use the same pseudo-polynomial time solution as subset sum: create an array of length "total length" (which we call n from now on) and, for each length k, add it to every existing length in the array (so if element m is populated, create a new entry at m + k (if m + k ≤ n), but leave the existing one at m as well), as well as creating a new array entry at location k representing the creation of a new set. You can build up a set of entries at array element i to represent the set of lists of length blocks totaling i. Each set entry should link back to the array entry it came from, which it can do by simply storing the last length that got you there. This is similar to a question I answered recently here, which you can adjust to allow duplicates as you see fit.
However, you need to modify the above approach to account for the gaps. Let's call the minimum gap between each element x and the maximum gap y. When adding an entry of length k, we include the minimum gap whenever adding it to another entry (so if m is populated, we actually create the entry at m + k + x). We continue to create the initial entry at k because we include the gaps between elements. When we create an entry, we can also determine if it fills the space. Suppose the new entry contains t elements and has total m. Then it fills the space iff m ≥ n - t ( y - x ). If it fills the space, we should add it to a solution list. Depending on how many solutions you want, you can terminate the algorithm as soon as enough solutions are found, or let it find all solutions. At the end, simply iterate through the solutions list.
Anything within the range can represent its gaps in any of a number of different ways, but one way that works is to greedily allocate the "slack" - for example, if you are 1000 away from the new total length using your above example, you can pick the first three gaps to be 500 (which is 300 extra slack each, for 900 total extra) and then the fourth to be 300 (for the extra 100 slack, totaling 1000) and every additional gap should be minimum (200).

Get most unique text from a group of text

I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.

Finding the Nth largest value in a group of numbers as they are generated

I'm writing a program than needs to find the Nth largest value in a group of numbers. These numbers are generated by the program, but I don't have enough memory to store N numbers. Is there a better upper bound than N that can be acheived for storage? The upper bound for the size of the group of numbers (and for N) is approximately 100,000,000.
Note: The numbers are decimals and the list can include duplicates.
[Edit]: My memory limit is 16 MB.
This is a multipass algorithm (therefore, you must be able to generate the same list multiple times, or store the list off to secondary storage).
First pass:
Find the highest value and the lowest value. That's your initial range.
Passes after the first:
Divide the range up into 10 equally spaced bins. We don't need to store any numbers in the bins. We're just going to count membership in the bins. So we just have an array of integers (or bigints--whatever can accurately hold our counts) Note that 10 is an arbitrary choice for the number of bins. Your sample size and distribution will determine the best choice.
Spin through each number in the data, incrementing the count of whichever bin holds the number you see.
Figure out which bin has your answer, and add how many numbers are above that bin to a count of numbers above the winning bin.
The winning bin's top and bottom range are your new range.
Loop through these steps again until you have enough memory to hold the numbers in the current bin.
Last pass:
You should know how many numbers are above the current bin by now.
You have enough storage to grab all the numbers within your range of the current bin, so you can spin through and grab the actual numbers. Just sort them and grab the correct number.
Example: if the range you see is 0.0 through 1000.0, your bins' ranges will be:
(- 0.0 - 100.0]
(100.0 - 200.0]
(200.0 - 300.0]
...
(900.0 - 1000.0)
If you find through the counts that your number is in the (100.0 - 2000.0] bin, your next set of bins will be:
(100.0 - 110.0]
(110.0 - 120.0]
etc.
Another multipass idea:
Simply do a binary search. Choose the midpoint of the range as the first guess. Your passes just need to do an above/below count to determine the next estimate (which can be weighted by the count, or a simple average for code simplicity).
Are you able to regenerate the same group of numbers from start? If you are, you could make multiple passes over the output: start by finding the largest value, restart the generator, find the largest number smaller than that, restart the generator, and repeat this until you have your result.
It's going to be a real performance killer, because you have a lot of numbers and a lot of passes will be required - but memory-wise, you will only need to store 2 elements (the current maximum and a "limit", the number you found during the last pass) and a pass counter.
You could speed it up by using your priority queue to find the M largest elements (choosing some M that you are able to fit in memory), allowing you to reduce the number of passes required to N/M.
If you need to find, say, the 10th largest element in a list of 15 numbers, you could save time by working the other way around. Since it is the 10th largest element, that means there are 15-10=5 elements smaller than this element - so you could look for the 6th smallest element instead.
This is similar to another question -- C Program to search n-th smallest element in array without sorting? -- where you may get some answers.
The logic will work for Nth largest/smallest search similarly.
Note: I am not saying this is a duplicate of that.
Since you have a lot (nearly 1 billion?) numbers, here is another way for space optimization.
Lets assume your numbers fit in 32-bit values, so about 1 billion would require sometime close to 32GB space. Now, if you can afford about 128MB of working memory, we can do this in one pass.
Imagine a 1 billion bit-vector stored as an array of 32-bit words
Let it be initialized to all zeros
Start running through your numbers and keep setting the correct bit position for the value of the number
When you are done with one pass, start counting from the start of this bit vector for the Nth set-bit
That bit's position gives you the value for your Nth largest number
You have actually sorted all the numbers in the process (however, count of duplicates is not tracked)
If I understood well, the upper bound memory usage for your program is O(N) (possibly N+1). You can maintain a list of the generated values that are greater than the current X (the Nth largest value so far) ordered by lowest first. As soon as a new greater value is generated, you can replace the current X by the first element of the list and insert the just generated value to its corresponding position in the list.
sort -n | uniq -c and the Nth should be the Nth row

Resources