Toy example: I have an algorithm that generates cake recipes by weight but in order to make them look clean on a website the ingredients need to be rounded to the nearest 50g. The total weight of the cake mix is 1kg.
for example a simple algorithm might be to sort the ingredients by their residual difference to the quantized level and then round each in turn then set the last ingredient to make up the difference to the constraint:
cake_1 = {flour: 322g, egg: 100g, sugar: 213g, milk: 189g, butter: 164g, cocoa: 12g}
cake_1_rounded = {egg: 100g, milk: 200g, cocoa: 0g, sugar: 200g, butter: 150g, flour: 350g}
However unless the residual distribution is mean zero this could dramatically affect the value of the last ingredient and so we may wish to distribute the additional weight in a more optimal manner.
Is this a standard problem? if so what is its name? are their objectively better approaches than the simple scheme I described here?
Related
I've been looking into the k nearest neighbors algorithm as I might be developing an application that matches fighters (boxers) in the near future.
The reason for my question, is to figure out which would be the best approach/algorithm to use when matching fighters based on multiple parameters and constraints depending on the rule-set.
The relevant properties of each fighter are the following:
Age (Fighters will be assigned to an agegroup (15, 17, 19, elite)
Weight
Amount of fights
Now there are some rulesets for what can be allowed when matching fighters:
A maximum of 2 years in between the fighters (unless it's elite)
A maximum of 3 kilo's difference in weight
Now obviously the perfect match, would be one where all the attendees gets matched with another boxer that fits within the ruleset.
And the main priority is to match as many fighters with each other as possible.
Is K-nn the way to go or is there a better approach?
If so which?
This is too long for a comment.
For best results with K-nn, I would suggest principal components. These allow you to use many more dimensions and do a pretty good job of spreading the data through the space, to get a good neighborhood.
As for incorporating existing rules, you have two choices. Probably, the best way is to build it into you distance function. Alternatively, you can take a large neighborhood and build it into the combination function.
I would go with k-Nearest Neighbor search. Since your dataset is in a low dimensional space (i.e. 3), I would use CGAL, in order to perform the task.
Now, the only thing you have to do, is to create a distance function like this:
float boxers_dist(Boxer a, Boxer b) {
if(abs(a.year - b.year) > 2 || abs(a.weight - b.weight) > e)
return inf;
// think how you should use the 3 dimensions you have, to compute distance
}
And you are done...now go fight!
Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH
Suppose we have buyers and sellers that are trying to find each other in a market. Buyers can tag their needs with keywords; sellers can do the same for what they are selling. I'm interested in finding algorithm(s) that rank-order sellers in terms of their relevance for a particular buyer on the basis of their two keyword sets.
Here is an example:
buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"}
and then we have two potential sellers that we need to rank order in terms of their relevance:
seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"}
seller_keywords[2] = {"likes catnip", "furry",
"hates mice", "yarn-lover", "whiskers"}
If we just use the intersection of keywords, we do not get much discrimination: both intersect on 2 keywords. If we divide the intersection count by the size of the set union, seller 2 actually does worse because of the greater number of keywords. This would seem to introduce an automatic penalty for any method not correcting keyword set size (and we definitely don't want to penalize adding keywords).
To put a bit more structure on the problem, suppose we have some truthful measure of intensity of keyword attributes (which have to sum to 1 for each seller), e.g.,:
seller_keywords[1] = {"furry":.05,
"four legs":.05,
"arctic circle":.8,
"white":.1}
seller_keywords[2] = {"likes catnip":.5,
"furry":.4,
"hates mice":.02,
"yarn-lover":.02,
"whiskers":.06}
Now we could sum up the value of hits: so now Seller 1 only gets a score of .1, while Seller 2 gets a score of .9. So far, so good, but now we might get a third seller with a very limited, non-descriptive keyword set:
seller_keywords[3] = {"furry":1}
This catapults them to the top for any hit on their sole keyword, which isn't good.
Anyway, my guess (and hope) is that this is a fairly general problem and that there exist different algorithmic solutions with known strengths and limitations. This is probably something covered in CS101, so I think a good answer to this question might just be a link to the relevant references.
I think you're looking to use cosine similarity; it's a basic technique that gets you pretty far as a first hack. Intuitively, you create a vector where every tag you know about has a particular index:
terms[0] --> aardvark
terms[1] --> anteater
...
terms[N] --> zuckerberg
Then you create vectors in this space for each person:
person1[0] = 0 # this person doesn't care about aardvarks
person1[1] = 0.05 # this person cares a bit about anteaters
...
person1[N] = 0
Each person is now a vector in this N-dimensional space. You can then use cosine similarity to calculate similarity between pairs of them. Calculationally, this is basically the same of asking for the angle between the two vectors. You want a cosine close to 1, which means that the vectors are roughly collinear -- that they have similar values for most of the dimensions.
To improve this metric, you may want to use tf-idf weighting on the elements in your vector. Tf-idf will downplay the importance of popular terms (e.g, 'iPhone') and promote the importance of unpopular terms that this person seems particularly associated with.
Combining tf-idf weighting and cosine similarity does well for most applications like this.
what you are looking for is called taxonomy. Tagging contents and ordering them by order of relevance.
You may not find some-ready-to-go-algorithm but you can start with a practical case : Drupal documentation for taxonomy provides some guidelines, and check sources of the search module.
Basically, the ranks is based on the term's frequency. If a product is defined with a small number of tags, they will have more weight. A tag which only appear on few products' page means that it is very specific. You shouldn't have your words' intensity defined on a static way ; but examines them in their context.
Regards
I have a bit of a problem with my project for the university.
I have to implement document classification using genetic algorithm.
I've had a look at this example and (lets say) understood the principles of the genetic algorithms but I'm not sure how they can be implemented in document classification. Can't figure out the fitness function.
Here is what I've managed to think of so far (Its probably totally wrong...)
Accept that I have the categories and each category is described by some keywords.
Split the file to words.
Create first population from arrays (100 arrays for example but it will depends on the size of the file) filled with random words from the file.
1:
Choose the best category for each child in the population (by counting the keywords in it).
Crossover each 2 children in the population (new array containing half of each children) - "crossover"
Fill the rest of the children left from the crossover with random not used words from the file - "evolution??"
Replace random words in random child from the new population with random word from the file (used or not) - "mutation"
Copy the best results to the new population.
Go to 1 until some population limit is reached or some category is found enough times
I'm not sure if this is correct and will be happy to have some advices, guys.
Much appreciate it!
Ivane, in order to properly apply GA's to document classification:
You have to reduce the problem to a system of components that can be evolved.
You can't do GA training for document classification on a single document.
So the steps that you've described are on the right track, but I'll give you some improvements:
Have a sufficient amount of training data: you need a set of documents which are already classified and are diverse enough to cover the range of documents which you're likely to encounter.
Train your GA to correctly classify a subset of those documents, aka the Training Data Set.
At each generation, test your best specimen against a Validation Data Set and stop training if the validation accuracy starts to decrease.
So what you want to do is:
prevValidationFitness = default;
currentValidationFitness = default;
bestGA = default;
while(currentValidationFitness.IsBetterThan( prevValidationFitness ) )
{
prevValidationFitness = currentValidationFitness;
// Randomly generate a population of GAs
population[] = randomlyGenerateGAs();
// Train your population on the training data set
bestGA = Train(population);
// Get the validation fitness fitness of the best GA
currentValidationFitness = Validate(bestGA);
// Make your selection (i.e. half of the population, roulette wheel selection, or random selection)
selection[] = makeSelection(population);
// Mate the specimens in the selection (each mating involves a crossover and possibly a mutation)
population = mate(selection);
}
Whenever you get get a new document (one which has not been classified before), you can now classify it with your best GA:
category = bestGA.Classify(document);
So this is not the end-all-be-all solution, but it should give you a decent start.
Pozdravi,
Kiril
You might find Learning Classifier Systems useful/interesting. An LCS is a type of evolutionary algorithm intended for classification problems. There is a chapter about them in Eiben & Smith's Introduction to Evolutionary Computing.
I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits,
deposit, customer, account, accounts
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
The system shall apply deposits to a
customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters =
CatKWs = an array/hash of lists of strings. The list contains the possible
keywords, for a given category.
usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
NbCats = integer number of pre-defined categories
Variables:
CatAccu = an array/hash of numeric values with one entry per each of the
possible categories. usage: CatAccu[3] = 4 (if array) or
CatAccu['CustTx'] += 1 (hash)
TotalKwOccurences = counts the total number of keywords matches (counts
multiple when a word is found in several pre-defined categories)
Pseudo code: (for categorizing one input item)
1. for x in 1 to NbCats
CatAccu[x] = 0 // reset the accumulators
2. for each word W in Item
for each x in 1 to NbCats
if W found in CatKWs[x]
TotalKwOccurences++
CatAccu[x]++
3. for each x in 1 to NbCats
CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating
4. Sort CatAccu by value
5. Return the ordered list of (CategoryID, rating)
for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.
Also, aside from the calculation of the rating per-se, we should also consider:
some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.
# predictor algorithm
for docs in readContent:
for x in range(len(docKywrdmppr)):
catAccum[x]=0
for i in range(len(docKywrdmppr)):
for word in removeStopWords(docs):
if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
print(word)
catAccum[i]=catAccum[i]+counter
print(catAccum)
ind=catAccum.index(max(catAccum))
print(ind)
predictedDoc.append(docKywrdmppr['Document Type'][ind])