Complex Combinatorial Algorithms - algorithm

So Wendy's advertises their sandwich as having 256 combinations - meaning there are 8 ingredients you can either have to not have (although I wonder why they would count the combination where you include nothing as valid, but I digress).
A generalized approach allows you to multiply the various states of each selection together, which allows more complex combinations. In this case Wendy's items can only be included or excluded. But some sandwiches might have the option of two kinds of mustard (but not both, to save costs).
These are fairly straightforward. You multiply the number of options together, so For Wendy's it's:
2*2*2*2*2*2*2*2 = 256
If they diversified their mustard selection as above it would be:
2*2*3*2*2*2*2*2 = 384
Going further appears to be harder.
If you make sesame seeds a separate item, then they require the bun item. You can have the sesame seed only if you include the bun, and you can have the bun without sesame seeds, but you cannot have sesame seeds without the bun. This can be simplified to a single bun item with three states (none, bun with seeds, bun without) but there are situations where that cannot be done.
Dell's computer configurator, for instance, disallows certain combinations (maybe the slots are all full, items are incompatible when put into same system, etc).
What are the appropriate combinatorial approaches when dealing with significantly more complex systems where items can conflict?
What are good, generalized, approaches to storing such information without having to code for each product/combination/item to catch conflicts?
Is there a simple way to say, "there are X ways to configure your system/sandwich" when the system has to deal with complex conflicting combinations?

HP's high-end server manufacturing facility in California used a custom rule-based system for many years to do just this.
The factory shopfloor build-cycle process included up-front checks to ensure the order was buildable prior to releasing it to the builders and testers.
One of these checks determined whether the order's bill of materials (BOM) conformed to a list of rules specified by the process engineers. For example, if the customer orders processors, ensure they have also ordered sufficient dc-converter parts; or, if they have ordered a certain quantity of memory DIMMs, ensure they have also ordered a daughter-board to accommodate the additional capacity.
A computer science student with a background in compilers would have recognized the code. The code parsed the BOM, internally generating a threaded tree of parts grouped by type. It then applied the rules to the internal tree to make the determination of whether the order conformed.
As a side-effect, the system also generated build documentation for each order which workers pulled up as they built each system. It also generated expected test results for the post-build burn-in process so the testing bays could reference them and determine whether everything was built correctly.

Adam Davis: If I understand correctly you intend to develop some sort of system that could in effect be used for shopping carts that assist users to purchase compatible parts.
Problem definition
Well this is a graph problem (aren't they all), you have items that are compatible with other items. For example, Pentium i3-2020 is compatible with any Socket 1155 Motherboard, The Asrock H61M-VS is a Socket 1155 Motherboard, which is compatible with 2xDDR3(speed = 1066), and requires a PCI-Express GPU, DDR3 PC RAM{Total(size) <= 16GB}, 4 pin ATX 12V power, etc.
You need to be able to (a) identify whether each item in the basket is satisfied by another item in the basket (i.e. RAM Card has a compatible Motherboard), (b) assign the most appropriate items (i.e. assign USB Hub to Motherboard USB port and Printer to USB Hub if motherboard runs out of USB ports, rather than do it the other way around and leave the hub dry), and (c) provide the user with a function to find a list of satisfying components. Perhaps USB Hubs can always take precedence as they are extensions (but do be aware of it).
Data structures you will need
You will need a simple classification system, i.e. H61M-VS is-a Motherboard, H61M-VS has-a DDR3 memory slot (with speed properties for each slot).
Second to classification and composition, you will need to identify requirements, which is quite simple. Now the simple classification can allow a simple SQL query to find all items that fit a classification.
Testing for a satisfying basket
To test the basket, a configuration need to be created, identifying which items are being matched with which (i.e. Motherboard's DDR3 slot matches with 4GB Ram module, SATA HDD cable connects to Motherboard SATA port and PSU's SATA power cable, while PSU's 4 pin ATX 12V power cable connects to the motherboard.
The simplest thing is just to check whether another satisfying item exists
Dell's Computer Configurator
You begin with one item, say a Processor. The processor requires a motherboard and a fan, so you can give them a choice of motherboard (adding the processor-fan to list_of_things_to_be_satisfied). This continues until there are no more items held in in list_of_things_to_be_satisfied. Of course this all depends on your exact-requirements and knowing what problem(s) you will solve for the user.

There are many ways you can implement this in code, but here is in my humble opinion, the best way to go about solving the problem before programming anything:
Define Parts & Products (Pre-code)
When defining all the "parts" it will be paramount to identify hierarchy and categorization for the parts. This is true because some rules may be exclusive to a unique part (ex. "brown mustard only"), some categorical (ex. "all mustards"), some by type (ex. "all condiments"), etc.
Build Rule Sets (Pre-code)
Define the rule sets (prerequisites, exclusions, etc.) for each unique part, category, type, and finished product.
It may sound silly, but a lot of care must be taken to ensure the rules are defined with an appropriate scope. For example, if the finished product is a Burger:
Unique Item Rule - "Mushrooms only available with Blue Cheese selected" prerequisite
Categorical Rule - "Only 1 mustard may be selected" exclusive
Type Rule - "Pickles are incompatible with Peppers" exclusive
After having spent so much time on unique/category/type rules for "parts", many designers will overlook rules that apply only to the finished product even when the parts have no conflict.
Product Rule - "Maximum 5 condiments" condition
Product Rule - "Burger must have a bun" prerequisite
This graph of rule can quickly grow very complex.
Suggestions for Building Data Structures (Code)
Ensure your structures accommodate hierarchy and categorization. For example: "brown mustard" and "dijon mustard" are individual objects, and they are both mustards, and both condiments.
Carefully select the right combination of inheritance modeling (base classes) and object attributes (ex. Category property, or HasCondiments flag) to make this work.
Make a private field for RuleSets at each hierarchic object level.
Make public properties for a HasConflicts flag, and a RuleViolations collection.
When a part is added to a product, check against all levels of rules (its own, category, type, and product) -- do this via a public function that can be called from the product. Or for better internalization, you can make an event handler on the part itself.
Write your algorithms (Code)
This is where I suck, and good thing as it is sort of beyond the scope of your question.
The trick with this step will be how to implement in code the rules that travel up the tree/graph -- for example, when a specific part has issue with another part outside its scope, or how does it's validation get run when another part is added? My thought:
Use a public function methodology for each part. Pass it the product's CurrentParts collection.
On the Product object, have handlers defined to handle OnPartAdded and OnPartRemoved, and have them enumerate the CurrentParts collection and call each part's validation function.
Example Bare-bones Prototype
interface IProduct
{
void AddPart();
void OnAddPart();
}
// base class for products
public class Product() : IProduct
{
// private or no setter. write functions as you like to add/remove parts.
public ICollection<Part> CurrentParts { get; };
// Add part function adds to collection and triggers a handler.
public void AddPart(Part p)
{
CurrentParts.Add(p);
OnAddParts();
}
// handler for adding a part should trigger part validations
public void OnAddPart()
{
// validate part-scope rules, you'll want to return some message/exception
foreach(var part in CurrentParts) {
part.ValidateRules(CurrentParts);
}
ValidateRules(); // validate Product-scope rules.
}
}
interface IProduct
{
// "object" should be replaced with whatever way you implement your rules
void object RuleSet;
void ValidateRules(ICollection<Part> otherParts);
}
// base class for parts
public class Part : IPart
{
public object RuleSet; // see note in interface.
public ValidateRules(ICollection<Part> otherParts)
{
// insert your algorithms here for validating
// the product parts against this part's rule set.
}
}
Nice and clean.

As a programmer I would do the following (although I have never actually ever had to do this in real life):
Work out the total number of
combinations, usually a straight
multiplication of the options as
stated in your question will suffice.
There's no need to store all these
combinations.
Then divide your total by the
exceptions. The exceptions can be
stored as just a set of rules,
effectively saying which combinations
are not allowed.
To work out a total number of
combinations allowable, you will have
to run through the entire set of
exception rules.
If you think of all your combinations as a set, then the exceptions just remove members of that set. But you don't need to store the entire set, just the exceptions, since you can calculate the size of the set quite easily.

"Generating Functions" comes to mind as one construct that can be used when solving this type of problem. I'd note that there are several different generating functions depending on what you want.
In North America, car license plates can be an interesting combinatorial problem in counting all the permutations where there are 36 possible values for each place of the 6 or 7 that are the lengths of license plates depending on where one is getting a plate. However, some combinations are disqualified due to there being swear words or racist words in some of them that makes for a slightly harder problem. For example, there is an infamour N-word that has at least a couple of different spellings that wouldn't be allowed on license plates I'd think.
Another example would be determining all the different orders of words using a given alphabet that contains some items repeated multiple times. For example, if one wanted all the different ways to arrange the letters of say the word "letter" it isn't just 6! which would be the case of "abcdef" because there are 2 pairs of letters that make it slightly trickier to compute.
L33t can be another way to bring in more complexity in identifying inappropriate words as while a-s-s gets censored a$$ or #ss may not necessarily be treated the same way even though it is basically the same term expressed in different ways. I'm not sure if many special characters like $ or # would appear on license plates but one could think of parental controls on web content as having to have these kinds of algorithms to identify which terms to censor.

You'd probably want to create a data structure that represents an individual configuration uniquely. Then each compatibility rule should be defined in a way where it can generate a set containing all the individual configurations that fail that rule. Then you would take the union of all the sets generated by all the rules to get the set of all configurations that fail the rules. Then you count the size of that set and subtract it from the size of the set all possible configurations.
The hard part is defining the data structure in a way that can be generated by your rules and can have the set operations work on it! That's an exercise for the reader, AKA I've got nothing.

The only thing I can think of right now is building is if you can build a tree that defines the dependency between the parts you have a simple solution.
sandwitch
|
|__Bun(2)__sesame(1)
|
|__Mustard(3)
|
|__Mayo(2)
|
|__Ketchup(2)
|
|__Olives(3)
this simply says that you have 2 options for the Bun (bun or no bun) - 1 for the sesame (only if you have a bun - signifying the dependency - if you have a 7 here it means 7 types that can exist if you only have a bun)
3 for the mustard .. etc
then simply multiply the sum of all branches.

It is probably possible to formalize the problem as a k-sat problem. In some cases, the problem appear to be NP-complete and you will have to enumerate all the possibilities to check wether they satisfy or not all the conditions. In some other cases, the problem will be easily solvable (when few conditions are required for instance). This is an active field of research. You will find relevant references on google scholar.
In the case of the mustard, you would add a binary entry "mustard_type" for the mustard type and introduce the condition: not (not mustard and mustard_type) where mustard is the binary entry for mustard. It will impose the default choice mustard_type == 0 when you choose not mustard.
For the sesame choice, this is more explicit: not (sesame and not bun).
It thus seems that the cases you propose fall into the 2-sat family of problems.

Related

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

What is the optimal way to choose a set of features for excluding items based on a bitmask when matching against a large set?

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.
Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.
In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):
objectMask = computeObjectMask(myObject)
for(each testObject in objectSet)
{
if((testObject.mask & objectMask) != objectMask)
{
// early out, some features are in objectMask
// but not in testObject.mask, so the test can't pass
}
else if(doComplicatedTest(testObject, myObject)
{
// found a match!
}
}
So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?
If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.
Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.
Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.
You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.

Algorithm / Data Structure for Finding Which of Many Sets are Subsets of another Set

Abstract Description:
I have a set of strings, call it the "active set", and a set of sets of strings - call that the "possible set". When a new string is added to the active set, sets from the possible set may suddenly be subsets of the active set because the active set lacked only that string to be a superset of one of the possibles. I need an algorithm to efficiently find these when I add a new string to the active set. Bonus points if the same data structure allows me to efficiently find which of these possible sets are invalidated (no longer a subset) when a string is removed from the active set.
(The reason I framed the problem described below in terms of sets and subsets of strings in the abstract above is that the language I'm writing this in (Io) is dynamically typed. Objects do have a "type" field but it is a string with the name of the object type in it.)
Background:
In my game engine I have GameObjects which can have several types of Representation objects added to them. For instance if a GameObject has physical presence it might have a PhysicsRepresentation added to it (or not if it's not a solid object). It might have various kinds of GraphicsRepresentations added to it, such as a mesh or particle effect (and you can have more than one if you have multiple visual effects attached to the same game object).
The point of this is to separate subsystems, but you can't completely separate everything: for instance when a GameObject has both a PhysicsRepresentation and a GraphicsRepresentation, something needs to create a 3rd object which connects the position of the GraphicsRepresentation to the location of the PhysicsRepresentation. To serve this purpose while still keeping all the components separate, I have Interaction objects. The Interaction object encapsulates the cross-cutting knowledge about how two system components have to interact.
But in order to protect GameObject from having to know too much about Representations and Interactions, GameObject just provides a generic registry where Interaction prototype objects can register to be called when a particular combination of Representations is present in the GameObject. When a new Representation is added to the GameObject, GameObject should look in it's registry and activate just those Interaction objects which are newly enabled by the presence of the new Representation, plus the existing Representations.
I'm just stuck on what data structure should be used for this registry and how to search it.
Errata:
The sets of strings are not necessarily sorted, but I can choose to store them sorted.
Although an Interaction most commonly will be between two Representations, I do not want to limit it to that; I should be able to have Interactions that trigger with 3 or more different representations, or even interactions that trigger based on just 1 representation.
I want to optimize this for the case of making it as fast as possible to add/remove representations.
I will have many active sets (each game object has an active set), but I have only one possible set (the set of all registered interaction types). So I don't care how long it takes to build the data structure that represents the possible set, because it only needs to be done once provided the algorithm for comparing different active sets is non-destructive of the possible set data structure.
If your sets are really small, the best representation is using bit sets. First, you build a map from strings to consecutive integers 0..N, where N is the number of distinct strings. Then you build your sets by bitwise OR-ing of 1<<k into a number. This lets you turn your set operations into bitwise operations, which are extremely fast (an intersection is an &; a union is an |, and so on).
Here is an example: Let's say you have two sets, A={quick, brown, fox} and B={brown, lazy, dog}. First, you build a string-to-number map, like this:
quick - 0
brown - 1
fox - 2
lazy - 3
dog - 4
Then your sets would become A=00111b and B=11010b. Their intersection is A&B = 00010b, and their union is A|B = 11111b. You know a set X is a subset of set Y if X == X&Y.
One way to do this would be to keep, for each subset, a count of how many of its strings were not in the main set, and a map from strings to lists of subsets containing that string, so that you can update the counts when you add or remove a new string to the active set, and notice when a count goes down to zero.
This problem reminds me of firing rules in a rule-based system when a fact becomes true, which corresponds to a new string being added to the active set. Many of these systems use http://en.wikipedia.org/wiki/Rete_algorithm. http://www.jboss.org/drools/drools-expert.html is an open source rule-based system - although it looks like there is a lot of enterprise system wrapping round it these days.

Machine Learning: Good way to represent word features

Not quite sure if this is the right place or not..
But here is my question.
So for features which are numeric in nature, it is quite natural to represent them, plot them, etc., but what about words?
How do you deal with data where you have words as features? So let's say I have a dataset with following features:
InventoryVal, Number of Units, Avg Price, Category of Event and so on..
InventoryVal is a number
Number of Units is a number
Avg Price is a number
Category of Event is a word that is assigned by humans.
Event if I replace category (example) "books" by an id...... (say 1) but then that is also something which I have assigned and that's not something intrinsic of data.
What is a good metric to represent that a product belongs to category "art" without artificially assigning anything?
Eghh.. too vague or loosely worded question?/
So as you might have guessed there are entire ML libraries directed to this problem, but if you just want to get started, the simplest (and perhaps most common) is word frequency. In other words, you represent each word as a feature whose value is a function of the number of times that words occurs in each document.
But the most common words (a, and, the, this, etc.) are the most commonly occurring (in ordinary text documents (e.g, email messages) but are hardly the most important, so it is common to express a word feature as the inverse of it's frequency.
So again, this is the simplest methodology (bag of words is how it's usually referred to); more sophisticated analysis (which are not always required) pre-process the individual words to categorize them into e.g., parts-of-speech analysis.
If you like python, i recommend NLTK (Natural Language Tool Kit) is a mature and well-documented python library. There are quite a few "getting started" tutorials, but perhaps begin with ones created by the NLTK contributors and which are referenced on the NLTK homepage; these tutorials usually rely on corpus (data set) included in the base NLTK install.
If you are using an existing machine learning package, or a packaged machine learning algorithm, there may be a way to tell it that a particular field holds e.g. integers which are to be treated as identifiers, in which only comparisons for equality and inequality make sense. If not, if there are only a small number of distinct categories, it might make sense to replace a category field with 10 values with 10 binary fields, holding 1 if the object is in that particular category, or 0 if not (or 9 fields, with the object in the 10th category if all of them are 0).

Categorizing Words and Category Values

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P

Resources