Generating in-order constrained sets - algorithm

First I will paste the scenario and then pose my question:
Suppose you have a list of Categories, for example:
Food,Meat,Dairy,Fruit,Vegetable,Grain,Wheat,Barley
Now you have a list of items that fits into one or more of the categories listed above.
Here is a sample list of items:
Pudding,Cheese,Milk,Chicken,Barley,Bread,Couscous,Fish,Apple,Tomato,
Banana,Grape,Lamb,Roast,Honey,Potato,Rice,Beans,Legume,Barley Soup
As you see every item fits into at least one category, it could fit into more, or possibly all but the minimum is always one.
For example Cheese is a Food and Dairy.
Each item has two attributes:
1) A Price Tag
2) A Random Value
A set is defined as having every category mapped to an item.
In other words all categories must be present in a set.
A set from the items above could be:
[Pudding,Lamb,Milk,Apple,Tomato,Legume,Bread,Barley Soup]
As you see each item is mapped to a category slot:
Pudding is mapped to Food Category
Lamb is mapped to Meat Category
Milk is mapped to Dairy Category
Apple is mapped to Fruit Category
Tomato is mapped to Vegetable Category
Legume is mapped to Grain Category
Bread is mapped to Wheat Category
Barley Soup is mapped to Barley Category
My question is, what is the most efficient algorithm for generating in-order sets of the above categories from a list of items given.
The best set is defined as having the highest Random Value in total.
The only constraint is that any generated set cannot, in total, exceed a certain fixed amount, in other words, all generated sets should be within this Price Cap.
Hope I am clear, thank you!

What you are trying to achieve is a form of maximal matching, and I don't know if there is an efficient way to compute in-order sets, but still this reduction might help you.
Define a bipartite graph with on one side one node per category, and on the other side one node per item. Add an edge between an item and a category if that items belongs in that category, with a weight defined by the random value of the item.
A "set" as you defined it is a maximum-cardinality matching in that graph.
They can be enumerated in reasonable time, as proved by Takeaki Uno in
"A Fast Algorithm for Enumerating Non-Bipartite Maximal Matchings", and it is likely to be even faster in your situation because your graph is bipartite.
Among those sets, you are looking for the ones with maximal weight and under a price constraint. Depending on your data, it may be enough to just enumerate them all, filter them based on the price, and sort the remaining results if there are not too many.
If that is not the case, then you may find the best set by solving the combinatorial optimization problem whose objective function is the total weight, and whose constraints are the price limit and the cardinal (known as maximum-weighted matching in the litterature). There may even be solvers already online once you write the problem in this form. However, this will only provide one such set rather than a sorted list, but as this problem is very hard in the general case, this is the best you can hope for. You would need more assumptions on your data to have better results (like bounds on the maximum number of such sets, the maximum number of items that can belong to more than k categories, etc.)

Alright, here is my second try to answer this question.
Lets say we have following input
class Item {
public:
// using single unsigned int bitwise check sets
unsigned int category;
int name;
int value;
int price;
...
};
class ItemSet
{
public:
set<Item> items;
int sum;
};
First, sort input data based on highest random value, then lowest price
bool operator<(const Item& item1, const Item& item2) {
if (item1.value == item2.value) {
if (item1.price == item2.price) {
return item1.name < item2.name;
}
return item1.price > item2.price;
}
return item1.value < item2.value;
}
...
vector<Item> v = generateTestItem();
sort(v.begin(), v.end(), greater<Item>());
Next use backtracking to collect top sets into heap until conditions met. Having sorted input data in our backtracking algorithm guarantees that collected data is top based on highest value and lowest price. One more thing to note, I compared item categories (currentCats) with bit manipulation which gives O(1) complexity.
priority_queue<ItemSet> output;
void helper(vector<Item>& input, set<Item>& currentItems, unsigned int currentCats, int sum, int index)
{
if (index == input.size()) {
// if index reached end of input, exit
addOutput(currentItems);
return;
}
if (output.size() >= TOP_X) {
// if output data size reached required size, exit
return;
}
if (sum + input[index].price < PRICE_TAG) {
if ((input[index].category & currentCats) == 0) {
// this category does not exists in currentCats
currentItems.insert(input[index]);
helper(input, currentItems, currentCats | input[index].category, sum + input[index].price, index + 1);
}
} else {
addOutput(currentItems);
return;
}
if (currentItems.find(input[index]) != currentItems.end()) {
currentItems.erase(input[index]);
}
helper(input, currentItems, currentCats, sum, index + 1);
return;
}
void getTopItems(vector<Item>& items)
{
set<Item> myset;
helper(items, myset, 0, 0, 0);
}
In the worst case this backtracking would run O(2^N) complexity, but since TOP_X is limited value it should not take too long in real.
I tried to test this code by generating random values and it seems working fine. Full code can be found here

I'm not exactly sure what you mean by "generating in-order sets".
I think any algorithm is going to generate sets, score them, and then try to generate better sets. Given all the constraints, I do not think you can generate the best set efficiently in one pass.
The 0-1 knapsack problem has been shown to be NP-hard, which means there is no known polynomial time (i.e. O(n^k)) solution. That problem is the same as you would have if, in your input, the random number was always equal to the price and there was only 1 category. In other words, your problem is at least as hard as the knapsack problem, so you cannot expect to find a guaranteed polynomial time solution.
You can generate all valid sets combinatorially pretty easily using nested loops: loop per category, looping over the items in that category. Early on you can improve the efficiency by skipping over an item if it has already been chosen and by skipping over the whole set once you find it is over the price cap. Put those results in a heap and then you can spit them out in order.
If your issue is that you want something with better performance than that, it seems to me more like a constraint programming, or, more specifically, a constraint satisfaction problem. I suggest you look at the techniques used to handle those kinds of problems.

Related

Data Structure with fast access to nth of elements satisfying condition

I'm filling a stack/vector (a dynamically sized container with fast random access by index with insertion only at the end) with composite data (a struct, class, tuple…). For a specific attribute with a small set of possible values, I will want to access the nth of all elements in the stack where this attribute satisfies a condition. To achieve this, additional information can be stored along each composite or in a separate data structure.
Note that the vector is large and that the compared attribute has a small value range but is compared to a set of allowed values. Also the attributes aren't distributed evenly throughout composites in the vector.
Pseudocode of a O(n) naïve approach. How can I improve this:
enum Fruit { apple, orange, banana, potato };
struct c {
Fruit a;
Data d;
}
// Let's assume v has a length of many thousand and that the distribution of fruits is *not* completely random e.g. that maybe potato only rarely occurs or that bananas tend to come in packs
c getFruit(vector<c> v, set<Fruit> s, int n) {
int counter=0;
// iterate over all of v's indices
for(int i=0 ; i<v.length; i+=1) {
if(v[i].a in s) {
if(n==counter) {
return v[i];
}
counter+=1;
}
}
}
// note: The attribute is compared to a set (arbitrary combination of fruits)!
getFruit(largeVector, set{apple, orange, potato}, 15234)
Another approach would be to create a vector for each possible set of fruits which would be super fast O(1) but not so memory efficient.
(Although I do have to implement this now, I'm really just asking out of curiousity because my data is small enough to just go with the naïve approach.)
Any argument why there doesn't seem to a more efficient way is very much approved as well.
Edit: It should be noted that new elements may be appended between queries for indices using the algorithm in question so any caches have to grow with the vector and both growing the vector and this filtered access should be fast.
For each index of the vector, store the preceding number of each fruit.
Then you can do a binary search to find the first index where the sum of the desired fruit counts is sufficient.
If you don't want to use that much memory, then store the counts in a separate arrays, and only store them for every 16th index or so in the main array. Your binary search will then get you an index within 16 positions of the desired answer, and you can do a linear scan from there.

Incorrect Recursive approach to finding combinations of coins to produce given change

I was recently doing a project euler problem (namely #31) which was basically finding out how many ways we can sum to 200 using elements of the set {1,2,5,10,20,50,100,200}.
The idea that I used was this: the number of ways to sum to N is equal to
(the number of ways to sum N-k) * (number of ways to sum k), summed over all possible values of k.
I realized that this approach is WRONG, namely due to the fact that it creates several several duplicate counts. I have tried to adjust the formula to avoid duplicates, but to no avail. I am seeking the wisdom of stack overflowers regarding:
whether my recursive approach is concerned with the correct subproblem to solve
If there exists one, what would be an effective way to eliminate duplicates
how should we approach recursive problems such that we are concerned with the correct subproblem? what are some indicators that we've chosen a correct (or incorrect) subproblem?
When trying to avoid duplicate permutations, a straightforward strategy that works in most cases is to only create rising or falling sequences.
In your example, if you pick a value and then recurse with the whole set, you will get duplicate sequences like 50,50,100 and 50,100,50 and 100,50,50. However, if you recurse with the rule that the next value should be equal to or smaller than the currently selected value, out of those three you will only get the sequence 100,50,50.
So an algorithm that counts only unique combinations would be e.g.:
function uniqueCombinations(set, target, previous) {
for all values in set not greater than previous {
if value equals target {
increment count
}
if value is smaller than target {
uniqueCombinations(set, target - value, value)
}
}
}
uniqueCombinations([1,2,5,10,20,50,100,200], 200, 200)
Alternatively, you can create a copy of the set before every recursion, and remove the elements from it that you don't want repeated.
The rising/falling sequence method also works with iterations. Let's say you want to find all unique combinations of three letters. This algorithm will print results like a,c,e, but not a,e,c or e,a,c:
for letter1 is 'a' to 'x' {
for letter2 is first letter after letter1 to 'y' {
for letter3 is first letter after letter2 to 'z' {
print [letter1,letter2,letter3]
}
}
}
m69 gives a nice strategy that often works, but I think it's worthwhile to better understand why it works. When trying to count items (of any kind), the general principle is:
Think of a rule that classifies any given item into exactly one of several non-overlapping categories. That is, come up with a list of concrete categories A, B, ..., Z that will make the following sentence true: An item is either in category A, or in category B, or ..., or in category Z.
Once you have done this, you can safely count the number of items in each category and add these counts together, comfortable in the knowledge that (a) any item that is counted in one category is not counted again in any other category, and (b) any item that you want to count is in some category (i.e., none are missed).
How could we form categories for your specific problem here? One way to do it is to notice that every item (i.e., every multiset of coin values that sums to the desired total N) either contains the 50-coin exactly zero times, or it contains it exactly once, or it contains it exactly twice, or ..., or it contains it exactly RoundDown(N / 50) times. These categories don't overlap: if a solution uses exactly 5 50-coins, it pretty clearly can't also use exactly 7 50-coins, for example. Also, every solution is clearly in some category (notice that we include a category for the case in which no 50-coins are used). So if we had a way to count, for any given k, the number of solutions that use coins from the set {1,2,5,10,20,50,100,200} to produce a sum of N and use exactly k 50-coins, then we could sum over all k from 0 to N/50 and get an accurate count.
How to do this efficiently? This is where the recursion comes in. The number of solutions that use coins from the set {1,2,5,10,20,50,100,200} to produce a sum of N and use exactly k 50-coins is equal to the number of solutions that sum to N-50k and do not use any 50-coins, i.e. use coins only from the set {1,2,5,10,20,100,200}. This of course works for any particular coin denomination that we could have chosen, so these subproblems have the same shape as the original problem: we can solve each one by simply choosing another coin arbitrarily (e.g. the 10-coin), forming a new set of categories based on this new coin, counting the number of items in each category and summing them up. The subproblems become smaller until we reach some simple base case that we process directly (e.g. no allowed coins left: then there is 1 item if N=0, and 0 items otherwise).
I started with the 50-coin (instead of, say, the largest or the smallest coin) to emphasise that the particular choice used to form the set of non-overlapping categories doesn't matter for the correctness of the algorithm. But in practice, passing explicit representations of sets of coins around is unnecessarily expensive. Since we don't actually care about the particular sequence of coins to use for forming categories, we're free to choose a more efficient representation. Here (and in many problems), it's convenient to represent the set of allowed coins implicitly as simply a single integer, maxCoin, which we interpret to mean that the first maxCoin coins in the original ordered list of coins are the allowed ones. This limits the possible sets we can represent, but here that's OK: If we always choose the last allowed coin to form categories on, we can communicate the new, more-restricted "set" of allowed coins to subproblems very succinctly by simply passing the argument maxCoin-1 to it. This is the essence of m69's answer.
There's some good guidance here. Another way to think about this is as a dynamic program. For this, we must pose the problem as a simple decision among options that leaves us with a smaller version of the same problem. It boils out to a certain kind of recursive expression.
Put the coin values c0, c1, ... c_(n-1) in any order you like. Then define W(i,v) as the number of ways you can make change for value v using coins ci, c_(i+1), ... c_(n-1). The answer we want is W(0,200). All that's left is to define W:
W(i,v) = sum_[k = 0..floor(200/ci)] W(i+1, v-ci*k)
In words: the number of ways we can make change with coins ci onward is to sum up all the ways we can make change after a decision to use some feasible number k of coins ci, removing that much value from the problem.
Of course we need base cases for the recursion. This happens when i=n-1: the last coin value. At this point there's a way to make change if and only if the value we need is an exact multiple of c_(n-1).
W(n-1,v) = 1 if v % c_(n-1) == 0 and 0 otherwise.
We generally don't want to implement this as a simple recursive function. The same argument values occur repeatedly, which leads to an exponential (in n and v) amount of wasted computation. There are simple ways to avoid this. Tabular evaluation and memoization are two.
Another point is that it is more efficient to have the values in descending order. By taking big chunks of value early, the total number of recursive evaluations is minimized. Additionally, since c_(n-1) is now 1, the base case is just W(n-1)=1. Now it becomes fairly obvious that we can add a second base case as an optimization: W(n-2,v) = floor(v/c_(n-2)). That's how many times the for loop will sum W(n-1,1) = 1!
But this is gilding a lilly. The problem is so small that exponential behavior doesn't signify. Here is a little implementation to show that order really doesn't matter:
#include <stdio.h>
#define n 8
int cv[][n] = {
{200,100,50,20,10,5,2,1},
{1,2,5,10,20,50,100,200},
{1,10,100,2,20,200,5,50},
};
int *c;
int w(int i, int v) {
if (i == n - 1) return v % c[n - 1] == 0;
int sum = 0;
for (int k = 0; k <= v / c[i]; ++k)
sum += w(i + 1, v - c[i] * k);
return sum;
}
int main(int argc, char *argv[]) {
unsigned p;
if (argc != 2 || sscanf(argv[1], "%d", &p) != 1 || p > 2) p = 0;
c = cv[p];
printf("Ways(%u) = %d\n", p, w(0, 200));
return 0;
}
Drumroll, please...
$ ./foo 0
Ways(0) = 73682
$ ./foo 1
Ways(1) = 73682
$ ./foo 2
Ways(2) = 73682

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}

Divide a group of people into two disjoint subgroups (of arbitrary size) and find some values

As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.

Permutations with extra restrictions

I have a set of items, for example: {1,1,1,2,2,3,3,3}, and a restricting set of sets, for example {{3},{1,2},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{2,3},{2,3}. I am looking for permutations of items, but the first element must be 3, and the second must be 1 or 2, etc.
One such permutation that fits is:
{3,1,1,1,2,2,3}
Is there an algorithm to count all permutations for this problem in general? Is there a name for this type of problem?
For illustration, I know how to solve this problem for certain types of "restricting sets".
Set of items: {1,1,2,2,3}, Restrictions {{1,2},{1,2,3},{1,2,3},{1,2},{1,2}}. This is equal to 2!/(2-1)!/1! * 4!/2!/2!. Effectively permuting the 3 first, since it is the most restrictive and then permuting the remaining items where there is room.
Also... polynomial time. Is that possible?
UPDATE: This is discussed further at below links. The problem above is called "counting perfect matchings" and each permutation restriction above is represented by a {0,1} on a matrix of slots to occupants.
https://math.stackexchange.com/questions/519056/does-a-matrix-represent-a-bijection
https://math.stackexchange.com/questions/509563/counting-permutations-with-additional-restrictions
https://math.stackexchange.com/questions/800977/parking-cars-and-vans-into-car-van-and-car-van-parking-spots
All of the other solutions here are exponential time--even for cases that they don't need to be. This problem exhibits similar substructure, and so it should be solved with dynamic programming.
What you want to do is write a class that memoizes solutions to subproblems:
class Counter {
struct Problem {
unordered_multiset<int> s;
vector<unordered_set<int>> v;
};
int Count(Problem const& p) {
if (m.v.size() == 0)
return 1;
if (m.find(p) != m.end())
return m[p];
// otherwise, attack the problem choosing either choosing an index 'i' (notes below)
// or a number 'n'. This code only illustrates choosing an index 'i'.
Problem smaller_p = p;
smaller_p.v.erase(v.begin() + i);
int retval = 0;
for (auto it = p.s.begin(); it != p.s.end(); ++it) {
if (smaller_p.s.find(*it) == smaller_p.s.end())
continue;
smaller_p.s.erase(*it);
retval += Count(smaller_p);
smaller_p.s.insert(*it);
}
m[p] = retval;
return retval;
}
unordered_map<Problem, int> m;
};
The code illustrates choosing an index i, which should be chosen at a place where there are v[i].size() is small. The other option is to choose a number n, which should be one for which there are few locations v that it can be placed in. I'd say the minimum of the two deciding factors should win.
Also, you'll have to define a hash function for Problem -- that shouldn't be too hard using boost's hash stuff.
This solution can be improved by replacing the vector with a set<>, and defining a < operator for unordered_set. This will collapse many more identical subproblems into a single map element, and further reduce mitigate exponential blow-up.
This solution can be further improved by making Problem instances that are the same except that the numbers are rearranged hash to the same value and compare to be the same.
You might consider a recursive solution that uses a pool of digits (in the example you provide, it would be initialized to {1,1,1,2,2,3,3,3}), and decides, at the index given as a parameter, which digit to place at this index (using, of course, the restrictions that you supply).
If you like, I can supply pseudo-code.
You could build a tree.
Level 0: Create a root node.
Level 1: Append each item from the first "restricting set" as children of the root.
Level 2: Append each item from the second restricting set as children of each of the Level 1 nodes.
Level 3: Append each item from the third restricting set as children of each of the Level 2 nodes.
...
The permutation count is then the number of leaf nodes of the final tree.
Edit
It's unclear what is meant by the "set of items" {1,1,1,2,2,3,3,3}. If that is meant to constrain how many times each value can be used ("1" can be used 3 times, "2" twice, etc.) then we need one more step:
Before appending a node to the tree, remove the values used on the current path from the set of items. If the value you want to append is still available (e.g. you want to append a "1", and "1" has only been used twice so far) then append it to the tree.
To save space, you could build a directed graph instead of a tree.
Create a root node.
Create a node for each item in the
first set, and link from the root to
the new nodes.
Create a node for each item in the
second set, and link from each first
set item to each second set item.
...
The number of permutations is then the number of paths from the root node to the nodes of the final set.

Resources