Algorithm / Data Structure for Finding Which of Many Sets are Subsets of another Set - performance

Abstract Description:
I have a set of strings, call it the "active set", and a set of sets of strings - call that the "possible set". When a new string is added to the active set, sets from the possible set may suddenly be subsets of the active set because the active set lacked only that string to be a superset of one of the possibles. I need an algorithm to efficiently find these when I add a new string to the active set. Bonus points if the same data structure allows me to efficiently find which of these possible sets are invalidated (no longer a subset) when a string is removed from the active set.
(The reason I framed the problem described below in terms of sets and subsets of strings in the abstract above is that the language I'm writing this in (Io) is dynamically typed. Objects do have a "type" field but it is a string with the name of the object type in it.)
Background:
In my game engine I have GameObjects which can have several types of Representation objects added to them. For instance if a GameObject has physical presence it might have a PhysicsRepresentation added to it (or not if it's not a solid object). It might have various kinds of GraphicsRepresentations added to it, such as a mesh or particle effect (and you can have more than one if you have multiple visual effects attached to the same game object).
The point of this is to separate subsystems, but you can't completely separate everything: for instance when a GameObject has both a PhysicsRepresentation and a GraphicsRepresentation, something needs to create a 3rd object which connects the position of the GraphicsRepresentation to the location of the PhysicsRepresentation. To serve this purpose while still keeping all the components separate, I have Interaction objects. The Interaction object encapsulates the cross-cutting knowledge about how two system components have to interact.
But in order to protect GameObject from having to know too much about Representations and Interactions, GameObject just provides a generic registry where Interaction prototype objects can register to be called when a particular combination of Representations is present in the GameObject. When a new Representation is added to the GameObject, GameObject should look in it's registry and activate just those Interaction objects which are newly enabled by the presence of the new Representation, plus the existing Representations.
I'm just stuck on what data structure should be used for this registry and how to search it.
Errata:
The sets of strings are not necessarily sorted, but I can choose to store them sorted.
Although an Interaction most commonly will be between two Representations, I do not want to limit it to that; I should be able to have Interactions that trigger with 3 or more different representations, or even interactions that trigger based on just 1 representation.
I want to optimize this for the case of making it as fast as possible to add/remove representations.
I will have many active sets (each game object has an active set), but I have only one possible set (the set of all registered interaction types). So I don't care how long it takes to build the data structure that represents the possible set, because it only needs to be done once provided the algorithm for comparing different active sets is non-destructive of the possible set data structure.

If your sets are really small, the best representation is using bit sets. First, you build a map from strings to consecutive integers 0..N, where N is the number of distinct strings. Then you build your sets by bitwise OR-ing of 1<<k into a number. This lets you turn your set operations into bitwise operations, which are extremely fast (an intersection is an &; a union is an |, and so on).
Here is an example: Let's say you have two sets, A={quick, brown, fox} and B={brown, lazy, dog}. First, you build a string-to-number map, like this:
quick - 0
brown - 1
fox - 2
lazy - 3
dog - 4
Then your sets would become A=00111b and B=11010b. Their intersection is A&B = 00010b, and their union is A|B = 11111b. You know a set X is a subset of set Y if X == X&Y.

One way to do this would be to keep, for each subset, a count of how many of its strings were not in the main set, and a map from strings to lists of subsets containing that string, so that you can update the counts when you add or remove a new string to the active set, and notice when a count goes down to zero.
This problem reminds me of firing rules in a rule-based system when a fact becomes true, which corresponds to a new string being added to the active set. Many of these systems use http://en.wikipedia.org/wiki/Rete_algorithm. http://www.jboss.org/drools/drools-expert.html is an open source rule-based system - although it looks like there is a lot of enterprise system wrapping round it these days.

Related

AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets?

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

MPI - communicate 1 element of a big type or more elements of a small type?

In the specific problem I'm dealing with, the processes arranged in a 3D topology have to exchange portions of a 3D array A(:,:,:) with each other. In particular, each one has to send a given number of slices of A to the processes in the six oriented directions (e.g. A(nx-1:nx,:,:) to the process in the positive 1st dimension, A(1:3,:,:) in the negative one, A(:,ny-3:ny,:) in the positive y-dimension, and so on).
In order to do so I'm going to define a set of subarray types (by means of MPI_TYPE_CREATE_SUBARRAY) to be used in communications (maybe MPI_NEIGHBOR_ALLTOALL, or its V or W extension). The question is about what the better choice, in terms of performance, between:
define 3 subarrays (one for each dimension), each one being actually a 2D array, and then make the communications send along each dimension a different number of these types in the two directions, or
define 6 subarray (one for each oriented direction), each one still being a 3D array, and then make the communications send along each dimension one element of the two types in the two directions?
Finally, to be more general, as in the title, is it better to define more "basic" MPI derived data types and use counts greater than 1 in the communications, or to define "bigger" types and and use counts = 1 in the communications?
MPI derived datatypes are defined to provide the library a means of packing and unpacking the data you send.
For basic types (MPI_INT, MPI_DOUBLE, etc.) there's no problem since the data in memory is already contiguous: there are no holes in memory.
More complex types such as multidimensional arrays or structures, sending the data as is may be inefficient due to the fact that you are probably sending useless data. For this reason, data is packed into a contiguous array of bytes, sent to the destination and then unpacked again to restore its original shape.
That being said, you need to create a derived datatype for each different shape in memory. For example, A(1:3,:,:) and A(nx-2:nx,:,:) represent the same datatype. But A(nx-2:nx,:,:) and A(:,nx-2:nx,:) don't. If you specify correctly the stride access (the gap between consecutive datatypes), you can even specify a 2D derived datatype and then vary the count argument to get better flexibility of your program.
Finally, to answer your last question, this probably worths benchmarking, although I think the difference will not be very noticeable, since it results in a single MPI message in both cases.

What is the optimal way to choose a set of features for excluding items based on a bitmask when matching against a large set?

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.
Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.
In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):
objectMask = computeObjectMask(myObject)
for(each testObject in objectSet)
{
if((testObject.mask & objectMask) != objectMask)
{
// early out, some features are in objectMask
// but not in testObject.mask, so the test can't pass
}
else if(doComplicatedTest(testObject, myObject)
{
// found a match!
}
}
So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?
If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.
Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.
Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.
You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.

Why 'set' data structure is told to be unordered?

Well, I think it's perfectly ordered, because position of the elements depends on their hash function, thus if that objects are immutable, then after you put them in the set, their position will remain unchanged. And everytime you, say, print your set you'll have exact the same elements order. Sure, you can't predict their position until their hash function is calculated, but anyway.
Generic / abstract data type
The definition of set from Aho, Hopcroft, Ullmann: 'Data Structures and Algorithms', Addison-Wesely, 1987:
A set is a collection of members (or elements); [...] . All members of a set are different, [...]
The abstract data type set does not have the characteristic of ordered or unordered.
There are some methods defined which operate on a set - but none of them has something to do with ordering (e.g see Martin Richards: 'Data Structures and Algorithms').
Two sets are seen equal, if each element from one set is also inside the other - and there are no additional elements.
When you write down a set (and therefore all the elements of a set) you need to write them down in some order. Note that this is just a representation of the appropriate set.
Example: A set which contains the elements one, two and three can be written down as {1, 2, 3}, {1, 3, 2}, {3, 1, 2} and so on. These are only different representations of the same set.
Specific implementations
In different programming languages sets are implemented in different ways with different use cases in mind.
In some languages (like python, JAVA) the standard set implementations do not expose ordering in their interfaces.
In some languages (like C++) the standard set implementation exposes ordering in their interfaces.
Example (C++):
Internally, the elements in a set are always sorted from lower to higher following a specific strict weak ordering criterion set on container construction.
template < class Key, class Compare = less,
class Allocator = allocator > class set;
(see C++ set).
The hash function is not (usually) an externally visible or modifiable parameter of a set; moreover, even if you have an implementation where the hash function is known and well characterized, you can't specify the behavior when hash values collide.
The usual summary of this is that the implementation of the set may impose an order, but the interface does not.
To which definition of set are you referring? In my understanding, «set» is a name for a data structure that contains a number of unique elements and usually allows addition and deletion. Everything else is not guaranteed and may be subject to implementation.
It doesn't say there is no order, but there is no specific order for every valid implementation. The use of hashtables is common, but using any type of list or tree is also possible.
So the order might be through a hashfunction (already lots of possible implementations), or related to order of addition or ...
After you add a new element to HashSet, it gets modified and the new element may get positioned anywhere depending on the hash value calculation. Thus the previous order is not maintained.

Complex Combinatorial Algorithms

So Wendy's advertises their sandwich as having 256 combinations - meaning there are 8 ingredients you can either have to not have (although I wonder why they would count the combination where you include nothing as valid, but I digress).
A generalized approach allows you to multiply the various states of each selection together, which allows more complex combinations. In this case Wendy's items can only be included or excluded. But some sandwiches might have the option of two kinds of mustard (but not both, to save costs).
These are fairly straightforward. You multiply the number of options together, so For Wendy's it's:
2*2*2*2*2*2*2*2 = 256
If they diversified their mustard selection as above it would be:
2*2*3*2*2*2*2*2 = 384
Going further appears to be harder.
If you make sesame seeds a separate item, then they require the bun item. You can have the sesame seed only if you include the bun, and you can have the bun without sesame seeds, but you cannot have sesame seeds without the bun. This can be simplified to a single bun item with three states (none, bun with seeds, bun without) but there are situations where that cannot be done.
Dell's computer configurator, for instance, disallows certain combinations (maybe the slots are all full, items are incompatible when put into same system, etc).
What are the appropriate combinatorial approaches when dealing with significantly more complex systems where items can conflict?
What are good, generalized, approaches to storing such information without having to code for each product/combination/item to catch conflicts?
Is there a simple way to say, "there are X ways to configure your system/sandwich" when the system has to deal with complex conflicting combinations?
HP's high-end server manufacturing facility in California used a custom rule-based system for many years to do just this.
The factory shopfloor build-cycle process included up-front checks to ensure the order was buildable prior to releasing it to the builders and testers.
One of these checks determined whether the order's bill of materials (BOM) conformed to a list of rules specified by the process engineers. For example, if the customer orders processors, ensure they have also ordered sufficient dc-converter parts; or, if they have ordered a certain quantity of memory DIMMs, ensure they have also ordered a daughter-board to accommodate the additional capacity.
A computer science student with a background in compilers would have recognized the code. The code parsed the BOM, internally generating a threaded tree of parts grouped by type. It then applied the rules to the internal tree to make the determination of whether the order conformed.
As a side-effect, the system also generated build documentation for each order which workers pulled up as they built each system. It also generated expected test results for the post-build burn-in process so the testing bays could reference them and determine whether everything was built correctly.
Adam Davis: If I understand correctly you intend to develop some sort of system that could in effect be used for shopping carts that assist users to purchase compatible parts.
Problem definition
Well this is a graph problem (aren't they all), you have items that are compatible with other items. For example, Pentium i3-2020 is compatible with any Socket 1155 Motherboard, The Asrock H61M-VS is a Socket 1155 Motherboard, which is compatible with 2xDDR3(speed = 1066), and requires a PCI-Express GPU, DDR3 PC RAM{Total(size) <= 16GB}, 4 pin ATX 12V power, etc.
You need to be able to (a) identify whether each item in the basket is satisfied by another item in the basket (i.e. RAM Card has a compatible Motherboard), (b) assign the most appropriate items (i.e. assign USB Hub to Motherboard USB port and Printer to USB Hub if motherboard runs out of USB ports, rather than do it the other way around and leave the hub dry), and (c) provide the user with a function to find a list of satisfying components. Perhaps USB Hubs can always take precedence as they are extensions (but do be aware of it).
Data structures you will need
You will need a simple classification system, i.e. H61M-VS is-a Motherboard, H61M-VS has-a DDR3 memory slot (with speed properties for each slot).
Second to classification and composition, you will need to identify requirements, which is quite simple. Now the simple classification can allow a simple SQL query to find all items that fit a classification.
Testing for a satisfying basket
To test the basket, a configuration need to be created, identifying which items are being matched with which (i.e. Motherboard's DDR3 slot matches with 4GB Ram module, SATA HDD cable connects to Motherboard SATA port and PSU's SATA power cable, while PSU's 4 pin ATX 12V power cable connects to the motherboard.
The simplest thing is just to check whether another satisfying item exists
Dell's Computer Configurator
You begin with one item, say a Processor. The processor requires a motherboard and a fan, so you can give them a choice of motherboard (adding the processor-fan to list_of_things_to_be_satisfied). This continues until there are no more items held in in list_of_things_to_be_satisfied. Of course this all depends on your exact-requirements and knowing what problem(s) you will solve for the user.
There are many ways you can implement this in code, but here is in my humble opinion, the best way to go about solving the problem before programming anything:
Define Parts & Products (Pre-code)
When defining all the "parts" it will be paramount to identify hierarchy and categorization for the parts. This is true because some rules may be exclusive to a unique part (ex. "brown mustard only"), some categorical (ex. "all mustards"), some by type (ex. "all condiments"), etc.
Build Rule Sets (Pre-code)
Define the rule sets (prerequisites, exclusions, etc.) for each unique part, category, type, and finished product.
It may sound silly, but a lot of care must be taken to ensure the rules are defined with an appropriate scope. For example, if the finished product is a Burger:
Unique Item Rule - "Mushrooms only available with Blue Cheese selected" prerequisite
Categorical Rule - "Only 1 mustard may be selected" exclusive
Type Rule - "Pickles are incompatible with Peppers" exclusive
After having spent so much time on unique/category/type rules for "parts", many designers will overlook rules that apply only to the finished product even when the parts have no conflict.
Product Rule - "Maximum 5 condiments" condition
Product Rule - "Burger must have a bun" prerequisite
This graph of rule can quickly grow very complex.
Suggestions for Building Data Structures (Code)
Ensure your structures accommodate hierarchy and categorization. For example: "brown mustard" and "dijon mustard" are individual objects, and they are both mustards, and both condiments.
Carefully select the right combination of inheritance modeling (base classes) and object attributes (ex. Category property, or HasCondiments flag) to make this work.
Make a private field for RuleSets at each hierarchic object level.
Make public properties for a HasConflicts flag, and a RuleViolations collection.
When a part is added to a product, check against all levels of rules (its own, category, type, and product) -- do this via a public function that can be called from the product. Or for better internalization, you can make an event handler on the part itself.
Write your algorithms (Code)
This is where I suck, and good thing as it is sort of beyond the scope of your question.
The trick with this step will be how to implement in code the rules that travel up the tree/graph -- for example, when a specific part has issue with another part outside its scope, or how does it's validation get run when another part is added? My thought:
Use a public function methodology for each part. Pass it the product's CurrentParts collection.
On the Product object, have handlers defined to handle OnPartAdded and OnPartRemoved, and have them enumerate the CurrentParts collection and call each part's validation function.
Example Bare-bones Prototype
interface IProduct
{
void AddPart();
void OnAddPart();
}
// base class for products
public class Product() : IProduct
{
// private or no setter. write functions as you like to add/remove parts.
public ICollection<Part> CurrentParts { get; };
// Add part function adds to collection and triggers a handler.
public void AddPart(Part p)
{
CurrentParts.Add(p);
OnAddParts();
}
// handler for adding a part should trigger part validations
public void OnAddPart()
{
// validate part-scope rules, you'll want to return some message/exception
foreach(var part in CurrentParts) {
part.ValidateRules(CurrentParts);
}
ValidateRules(); // validate Product-scope rules.
}
}
interface IProduct
{
// "object" should be replaced with whatever way you implement your rules
void object RuleSet;
void ValidateRules(ICollection<Part> otherParts);
}
// base class for parts
public class Part : IPart
{
public object RuleSet; // see note in interface.
public ValidateRules(ICollection<Part> otherParts)
{
// insert your algorithms here for validating
// the product parts against this part's rule set.
}
}
Nice and clean.
As a programmer I would do the following (although I have never actually ever had to do this in real life):
Work out the total number of
combinations, usually a straight
multiplication of the options as
stated in your question will suffice.
There's no need to store all these
combinations.
Then divide your total by the
exceptions. The exceptions can be
stored as just a set of rules,
effectively saying which combinations
are not allowed.
To work out a total number of
combinations allowable, you will have
to run through the entire set of
exception rules.
If you think of all your combinations as a set, then the exceptions just remove members of that set. But you don't need to store the entire set, just the exceptions, since you can calculate the size of the set quite easily.
"Generating Functions" comes to mind as one construct that can be used when solving this type of problem. I'd note that there are several different generating functions depending on what you want.
In North America, car license plates can be an interesting combinatorial problem in counting all the permutations where there are 36 possible values for each place of the 6 or 7 that are the lengths of license plates depending on where one is getting a plate. However, some combinations are disqualified due to there being swear words or racist words in some of them that makes for a slightly harder problem. For example, there is an infamour N-word that has at least a couple of different spellings that wouldn't be allowed on license plates I'd think.
Another example would be determining all the different orders of words using a given alphabet that contains some items repeated multiple times. For example, if one wanted all the different ways to arrange the letters of say the word "letter" it isn't just 6! which would be the case of "abcdef" because there are 2 pairs of letters that make it slightly trickier to compute.
L33t can be another way to bring in more complexity in identifying inappropriate words as while a-s-s gets censored a$$ or #ss may not necessarily be treated the same way even though it is basically the same term expressed in different ways. I'm not sure if many special characters like $ or # would appear on license plates but one could think of parental controls on web content as having to have these kinds of algorithms to identify which terms to censor.
You'd probably want to create a data structure that represents an individual configuration uniquely. Then each compatibility rule should be defined in a way where it can generate a set containing all the individual configurations that fail that rule. Then you would take the union of all the sets generated by all the rules to get the set of all configurations that fail the rules. Then you count the size of that set and subtract it from the size of the set all possible configurations.
The hard part is defining the data structure in a way that can be generated by your rules and can have the set operations work on it! That's an exercise for the reader, AKA I've got nothing.
The only thing I can think of right now is building is if you can build a tree that defines the dependency between the parts you have a simple solution.
sandwitch
|
|__Bun(2)__sesame(1)
|
|__Mustard(3)
|
|__Mayo(2)
|
|__Ketchup(2)
|
|__Olives(3)
this simply says that you have 2 options for the Bun (bun or no bun) - 1 for the sesame (only if you have a bun - signifying the dependency - if you have a 7 here it means 7 types that can exist if you only have a bun)
3 for the mustard .. etc
then simply multiply the sum of all branches.
It is probably possible to formalize the problem as a k-sat problem. In some cases, the problem appear to be NP-complete and you will have to enumerate all the possibilities to check wether they satisfy or not all the conditions. In some other cases, the problem will be easily solvable (when few conditions are required for instance). This is an active field of research. You will find relevant references on google scholar.
In the case of the mustard, you would add a binary entry "mustard_type" for the mustard type and introduce the condition: not (not mustard and mustard_type) where mustard is the binary entry for mustard. It will impose the default choice mustard_type == 0 when you choose not mustard.
For the sesame choice, this is more explicit: not (sesame and not bun).
It thus seems that the cases you propose fall into the 2-sat family of problems.

Resources