Algorithm to find the optimal items to buy to reach certain criteria - algorithm

We have n item types that we can buy (we have unlimited stock of each):
{ p:100.0f, a:10.0f, b:20.0f }
{ p:77.0f, a:20.0f, b:10.0f }
{ p:55.0f, a:0.0f, b:12.0f }
let a and b be some random properties of the item (ie. quality and performance, this is irrelevant to the problem). We then have two values:
a: 12.0f
b: 4.0f
These two values signify the properties of our items that we are looking for, these numbers have to precisely match - we need to find the best combination of items to buy so that we have reached our targets, at the lowest p. Note that individual items can be used in fractional amounts (0.5 of a certain item has 0.5 of it's p, a, and b values)
Task: Minimize p while matching the total of a and b with the required a and b, find the best configuration and print it (including amount of each item we need).
Note that not all item types have to be used
I've tried solving this as a knapsack problem, but I was unable to get it working.

Related

How do I find the right optimisation algorithm for my problem?

Disclaimer: I'm not a professional programmer or mathematician and this is my first time encountering the field of optimisation problems. Now that's out of the way so let's get to the problem at hand:
I got several lists, each containing various items and number called 'mandatoryAmount':
listA (mandatoryAmountA, itemA1, itemA2, itemA2, ...)
Each item has certain values (each value is a number >= 0):
itemA1 (M, E, P, C, Al, Ac, D, Ab,S)
I have to choose a certain number of items from each list determined by 'mandatoryAmount'.
Within each list I can choose every item multiple times.
Once I have all of the items from each list, I'll add up the values of each.
For example:
totalM = listA (itemA1 (M) + itemA1 (M) + itemA3 (M)) + listB (itemB1 (M) + itemB2 (M))
The goals are:
-To have certain values (totalAl, totalAc, totalAb, totalS) reach a certain number cap while going over that cap as little as possible. Anything over that cap is wasted.
-To maximize the remaining values with different weightings each
The output should be the best possible selection of items to meet the goals stated above. I imagine the evaluation function to just add up all non-waste values times their respective weightings while subtracting all wasted stats times their respective weightings.
edit:
The total amount of items across all lists should be somewhere between 500 and 1000, the number of lists is around 10 and the mandatoryAmount for each list is between 0 and 14.
Here's some sample code that uses Python 3 and OR-Tools. Let's start by
defining the input representation and a random instance.
import collections
import random
Item = collections.namedtuple("Item", ["M", "E", "P", "C", "Al", "Ac", "D", "Ab", "S"])
List = collections.namedtuple("List", ["mandatoryAmount", "items"])
def RandomItem():
return Item(
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
)
lists = [
List(
random.randrange(5, 10), [RandomItem() for j in range(random.randrange(5, 10))]
)
for i in range(random.randrange(5, 10))
]
Time to formulate the optimization as a mixed-integer program. Let's import
the solver library and initialize the solver object.
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver.CreateSolver("solver", "SCIP")
Make constraints for the totals that must reach a certain cap.
AlCap = random.random()
totalAl = solver.Constraint(AlCap, solver.infinity())
AcCap = random.random()
totalAc = solver.Constraint(AcCap, solver.infinity())
AbCap = random.random()
totalAb = solver.Constraint(AbCap, solver.infinity())
SCap = random.random()
totalS = solver.Constraint(SCap, solver.infinity())
We want to maximize the other values subject to some weighting.
MWeight = random.random()
EWeight = random.random()
PWeight = random.random()
CWeight = random.random()
DWeight = random.random()
solver.Objective().SetMaximization()
Create variables and fill in the constraints. For each list there is an
equality constraint on the number of items.
associations = []
for list_ in lists:
amount = solver.Constraint(list_.mandatoryAmount, list_.mandatoryAmount)
for item in list_.items:
x = solver.IntVar(0, solver.infinity(), "")
amount.SetCoefficient(x, 1)
totalAl.SetCoefficient(x, item.Al)
totalAc.SetCoefficient(x, item.Ac)
totalAb.SetCoefficient(x, item.Ab)
totalS.SetCoefficient(x, item.S)
solver.Objective().SetCoefficient(
x,
MWeight * item.M
+ EWeight * item.E
+ PWeight * item.P
+ CWeight * item.C
+ DWeight * item.D,
)
associations.append((item, x))
if solver.Solve() != solver.OPTIMAL:
raise RuntimeError
solution = []
for item, x in associations:
solution += [item] * round(x.solution_value())
print(solution)
I think David Eisenstat has the right idea with Integer programming, but let's see if we get some good solutions otherwise and perhaps provide some initial optimization. However, I think that we can just choose all of one item in each list may make this easier to solve that it normally would be. Basically that turns it into more of a Subset Sum problem. Especially with the cap.
There are two possibilities here:
There is no solution, no condition satisfies the requirement.
There is a solution that we need to be optimized.
We really want to try to find a solution first, if we can find one (regardless of the amount of waste), then that's nice.
So let's reframe the problem: We aim to simply minimize waste, but we also need to meet a min requirement. So let's try to get as much waste as possible in ways we need it.
I'm going to propose an algorithm you could use that should work "fairly well" and is polynomial time, though could probably have some optimizations. I'll be using K to mean mandatoryAmount as it's a bit of a customary variable in this situation. Also I'll be using N to mean the number of lists. Lastly, Z to represent the total number of items (across all lists).
Get the list of all items and sort them by the amount of each value they have (first the goal values, then the bonus values). If an item has 100A, 300C, 200B, 400D, 150E and the required are [B, D], then the sort order would look like: [400,200,300,150,100]. Repeat but for one goal value. Using the same example above we would have: [400,300,150,100] for goal: D and [200,300,150,100] for goal B. Create a boolean variable for optimization mode (we start by seeking for a solution, once we find one, we can try to optimize it). Create a counter/hash to contain the unassigned items. An item cannot be unassigned more than K times (to avoid infinite loops). This isn't strictly needed, but could work as an optimization for step 5, as it prioritize goals you actually need.
For each list, keep a counter of the number of assignable slots for each list, set each to K, as well as the number of total assignable slots, and set to K * N. This will be adjusted as needed along the way. You want to be able to quickly O(1) lookup for: a) which list an (sorted) item belongs to, b) how many available slots that item has, and c) How many times has the item been unassigned, d) Find the item is the sorted list.
General Assignment. While there are slots available (total slots), go through the list from highest to lowest order. If the list for that item is available, assign as many slots as possible to that item. Update the assignable and total slots. If result is a valid solution, record it, trip the "optimization mode flag". If slots remain unassigned, revert the previous unassignment (but do not change the assignment count).
Waste Optimization. Find the most wasteful item that can be unassigned (unassigned count < K). Unassign one slot of it. If in optimization mode, do not allow any of the goal values to go below their cap (skip if it would). Update the unassigned count for item. Goto #3, but start just after the wasteful item. If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
Goal value Optimization. Skip if current state is a valid solution. Find the value furthest from it's goal (IE: A/B/C/D/E above) that can be unassigned. Unassign one slot for that item. Update assignment count. Goto step 3, begin search at start of list (unlike Step 4), stop searching the list if you go below the value of this item (not this item itself, as others may have the same value). If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
No Assignments remain. Return current state as "best solution found".
Algorithm should end with the "best" solution that this approach can come up with. Increasing max unassignment counts may improve the solution, decreasing max assignment counts will speed up the algorithm. Algorithm will run until it has maxed out it's assignment counts.
This is a bit of a greedy algorithm, so I'm not sure it's optimal (in that it will always yield the best result) but it may give you some ideas as to how to approach it. It also feels like it should yield fairly good results, as it basically trying to bound the results. Algorithm performance is something like O(Z^2 * K), where K is the mandatoryAmount and Z is the total number of items. Each item is unassigned K items, and potentially each assignment also requires O(Z) checks before it is reassigned.
As an optimization, use a O(log N) or better delete/next operation sorted data structure to store the sorted lists. Doing so it would make it practical to delete items from the assignment lists once the unassignment count reaches K (rendering them no longer assignable) allowing for O(Z * log(Z) * K) performance instead.
Edit:
Hmmm, the above only works within a single list (IE: Item removed can only be added to it's own list, as only that list has room). To avoid this, do step 4 (remove too heavy) then step 5 (remove too light) and then goto step 3 (using step 5's rules for searching, but also disallow adding back the too heavy ones).
So basically we remove the heaviest one then the lightest one then we try to assign something that is as heavy as possible to make up for the lightest one we removed.

Generating in-order constrained sets

First I will paste the scenario and then pose my question:
Suppose you have a list of Categories, for example:
Food,Meat,Dairy,Fruit,Vegetable,Grain,Wheat,Barley
Now you have a list of items that fits into one or more of the categories listed above.
Here is a sample list of items:
Pudding,Cheese,Milk,Chicken,Barley,Bread,Couscous,Fish,Apple,Tomato,
Banana,Grape,Lamb,Roast,Honey,Potato,Rice,Beans,Legume,Barley Soup
As you see every item fits into at least one category, it could fit into more, or possibly all but the minimum is always one.
For example Cheese is a Food and Dairy.
Each item has two attributes:
1) A Price Tag
2) A Random Value
A set is defined as having every category mapped to an item.
In other words all categories must be present in a set.
A set from the items above could be:
[Pudding,Lamb,Milk,Apple,Tomato,Legume,Bread,Barley Soup]
As you see each item is mapped to a category slot:
Pudding is mapped to Food Category
Lamb is mapped to Meat Category
Milk is mapped to Dairy Category
Apple is mapped to Fruit Category
Tomato is mapped to Vegetable Category
Legume is mapped to Grain Category
Bread is mapped to Wheat Category
Barley Soup is mapped to Barley Category
My question is, what is the most efficient algorithm for generating in-order sets of the above categories from a list of items given.
The best set is defined as having the highest Random Value in total.
The only constraint is that any generated set cannot, in total, exceed a certain fixed amount, in other words, all generated sets should be within this Price Cap.
Hope I am clear, thank you!
What you are trying to achieve is a form of maximal matching, and I don't know if there is an efficient way to compute in-order sets, but still this reduction might help you.
Define a bipartite graph with on one side one node per category, and on the other side one node per item. Add an edge between an item and a category if that items belongs in that category, with a weight defined by the random value of the item.
A "set" as you defined it is a maximum-cardinality matching in that graph.
They can be enumerated in reasonable time, as proved by Takeaki Uno in
"A Fast Algorithm for Enumerating Non-Bipartite Maximal Matchings", and it is likely to be even faster in your situation because your graph is bipartite.
Among those sets, you are looking for the ones with maximal weight and under a price constraint. Depending on your data, it may be enough to just enumerate them all, filter them based on the price, and sort the remaining results if there are not too many.
If that is not the case, then you may find the best set by solving the combinatorial optimization problem whose objective function is the total weight, and whose constraints are the price limit and the cardinal (known as maximum-weighted matching in the litterature). There may even be solvers already online once you write the problem in this form. However, this will only provide one such set rather than a sorted list, but as this problem is very hard in the general case, this is the best you can hope for. You would need more assumptions on your data to have better results (like bounds on the maximum number of such sets, the maximum number of items that can belong to more than k categories, etc.)
Alright, here is my second try to answer this question.
Lets say we have following input
class Item {
public:
// using single unsigned int bitwise check sets
unsigned int category;
int name;
int value;
int price;
...
};
class ItemSet
{
public:
set<Item> items;
int sum;
};
First, sort input data based on highest random value, then lowest price
bool operator<(const Item& item1, const Item& item2) {
if (item1.value == item2.value) {
if (item1.price == item2.price) {
return item1.name < item2.name;
}
return item1.price > item2.price;
}
return item1.value < item2.value;
}
...
vector<Item> v = generateTestItem();
sort(v.begin(), v.end(), greater<Item>());
Next use backtracking to collect top sets into heap until conditions met. Having sorted input data in our backtracking algorithm guarantees that collected data is top based on highest value and lowest price. One more thing to note, I compared item categories (currentCats) with bit manipulation which gives O(1) complexity.
priority_queue<ItemSet> output;
void helper(vector<Item>& input, set<Item>& currentItems, unsigned int currentCats, int sum, int index)
{
if (index == input.size()) {
// if index reached end of input, exit
addOutput(currentItems);
return;
}
if (output.size() >= TOP_X) {
// if output data size reached required size, exit
return;
}
if (sum + input[index].price < PRICE_TAG) {
if ((input[index].category & currentCats) == 0) {
// this category does not exists in currentCats
currentItems.insert(input[index]);
helper(input, currentItems, currentCats | input[index].category, sum + input[index].price, index + 1);
}
} else {
addOutput(currentItems);
return;
}
if (currentItems.find(input[index]) != currentItems.end()) {
currentItems.erase(input[index]);
}
helper(input, currentItems, currentCats, sum, index + 1);
return;
}
void getTopItems(vector<Item>& items)
{
set<Item> myset;
helper(items, myset, 0, 0, 0);
}
In the worst case this backtracking would run O(2^N) complexity, but since TOP_X is limited value it should not take too long in real.
I tried to test this code by generating random values and it seems working fine. Full code can be found here
I'm not exactly sure what you mean by "generating in-order sets".
I think any algorithm is going to generate sets, score them, and then try to generate better sets. Given all the constraints, I do not think you can generate the best set efficiently in one pass.
The 0-1 knapsack problem has been shown to be NP-hard, which means there is no known polynomial time (i.e. O(n^k)) solution. That problem is the same as you would have if, in your input, the random number was always equal to the price and there was only 1 category. In other words, your problem is at least as hard as the knapsack problem, so you cannot expect to find a guaranteed polynomial time solution.
You can generate all valid sets combinatorially pretty easily using nested loops: loop per category, looping over the items in that category. Early on you can improve the efficiency by skipping over an item if it has already been chosen and by skipping over the whole set once you find it is over the price cap. Put those results in a heap and then you can spit them out in order.
If your issue is that you want something with better performance than that, it seems to me more like a constraint programming, or, more specifically, a constraint satisfaction problem. I suggest you look at the techniques used to handle those kinds of problems.

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}

Divide a group of people into two disjoint subgroups (of arbitrary size) and find some values

As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.

Algorithm for discrete similarity metric

Given that I have two lists that each contain a separate subset of a common superset, is
there an algorithm to give me a similarity measurement?
Example:
A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }
How similar are these two lists? Note that I do not know all elements of the common superset.
Update:
I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies.
Clarification: Order is of importance.
If identical elements occupy the same position in the list, we have the highest similarity for that element.
The similarity decreased the farther apart the identical elements are.
The similarity is even lower if the element only exists in one of the lists.
I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.
The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.
The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or
tau = Nc / (Na + Nb - Nc)
Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items
common to both a and b
Here's Tanimoto coded as a Python function:
def tanimoto(x, y) :
w = [ ns for ns in x if ns not in y ]
return float(len(w) / (len(x) + len(y) - len(w)))
I would explore two strategies:
Treat the lists as sets and apply set ops (intersection, difference)
Treat the lists as strings of symbols and apply the Levenshtein algorithm
If you truly have sets (i.e., an element is simply either present or absent, with no count attached) and only two of them, just adding the number of shared elements and dividing by the total number of elements is probably about as good as it gets.
If you have (or can get) counts and/or more than two of them, you can do a bit better than that with something like cosine simliarity or TFIDF (term frequency * inverted document frequency).
The latter attempts to give lower weighting to words that appear in all (or nearly) all the "documents" -- i.e., sets of words.
What is your definition of "similarity measurement?" If all you want is how many items in the set are in common with each other, you could find the cardinality of A and B, add the cardinalities together, and subtract from the cardinality of the union of A and B.
If order matters you can use Levenshtein distance or other kind of Edit distance
.

Resources