Finding most distinct elements in a set - algorithm

Say we have a perfume shop that has 100 different perfumes.
Let's say 10,000 customers come in an rate each perfume one through five stars.
Let's say the question is: "how to best construct a pack of 5 perfumes so that 95% customers will give a 4+ star rating for at least one of them"
How to do this algorithmically?
NOTE: I can see that even the question isn't properly formed; there's no guarantee that such a construction even exists. There is a trade-off between 2 parameters.
NOTE: Also, (and this makes the perfume analogy becomes slightly artificial), it doesn't matter whether we get one good match or three good matches. So {4.3, 0, 0, 0, 0} would be equivalent to {4.3, 4.2, 4.2, 4.2, 4.2} -- in both cases the score is 4.3.
Let's say for the purpose of argument that perfumes 0-19 are sweet, perfumes 20-39 are sour, etc (sim. salt, bitter, unami)
So there would be very high crosscorrelation between 0-19.
If you modelled this with 100 points in space, then 0-19 would all attract each other very strongly, they would form a cluster.
Similarly you would get 4 other clusters for the other four tastes.
So from just one metric, we have separated out 5 distinct flavours.
But does this technique extend?
π
PS just giving the names of related techniques would be very helpful, as this would allow me to Google for further information. So any answer that just restates the question in industry accepted terminology would be useful!

This algorithm should find a solution to the problem:
Order the perfumes by the number of customers giving a 4+ rating
Choose the first perfume not concidered yet from the list
Delete the ratings from the customers now satisfied.
Repeat the process for perfumes 2 - 5 in the pack.
Backtrace when neccessary to obtain a selection satisfying the criterion.

The true problem is NP-hard, but you can make use of a greedy algorithm:
Let C be the whole of your customers.
Assign to each perfume a coverage given by the number of customers in C that gave 4+ to each perfume
Sort by descending coverage. If C is empty and all coverages are zero, choose a perfume at random (actually, if C is nonzero but < 5% of the original, your requisite is met)
Remove from C all customers (not ratings) satisfied by the perfume just chosen
Repeat from 2 unless you already have 5 perfumes.
This automatically takes care of taste clustering: a customer giving high marks to sweet perfumes will be satisfied by the most voted sweet perfume, and he will then be struck out from C, all his further ratings ignored, and the algorithm will proceed to satisfy other customers.
Also, you should notice that even if you can't satisfy the requisite (95%, 4+) with five perfumes, perfume similarity will ensure that this algorithm maximizes both the coverage and the marks - so you might end up with, say, (93%, 3.9).
Also, suppose that 10% of users do not give any marks above 3. There's no way that you can 4-satisfy 95% of customers, since 10% of total are at most 3-satisfiable. You might want to build C with customers that actually did give at least one 4+ rating.
Or you could change the algorithm and instead of the one in your question, decide on using a knapsack: you want to take home the highest cumulative rating. This also raises the likelihood of a customer being satisfied by the overall package (as is, he is almost guaranteed to very much like one perfume, but he might strongly dislike the other four).

Related

How to give players a score on a ranking/prediction task?

I have a website built with php/mysql, and I am looking for help in communicating to a Programmer what I want him to do with a Poll/Prediction game that I am trying to create.
For purposes of discussion, assume a game where perhaps 100 players try to predict the top 5 finishers in a Golf Tournament of perhaps 9 Golfers.
I am looking for help in how to create and assign a score based upon the accuracy of prediction.
The players provide a rank ordering using a drag and drop function to order the players from 1 through 5. This ordering has already been coded, and the ranks are stored somehow in the DB (I do not know how).
My initial thinking is to ask the coder to create a script which will assign a score from 1 to 5 for each Golfer that the player nominated to be in the Top 5.
So, a player who predicted perfectly would be awarded a perfect score of 12345.
His first golfer received a 1 for finishing first, second a 2 for finishing second, third golfer receives a 3 for finishing third, and so on.
Anybody less than perfect would have a score higher than 12345.
Players who got the first four positions correct would have to be differentiated on the basis of the finish of their fifth Golfer.
So, one might score 12347 and the other 12348 and the player with the highest score (12348) would be the loser in a matchup of the two players.
A player who did poorly, might have a score of 53419.
Question:
Is this a viable way of creating a score which the players of my game can be ranked upon?
Is it possible to instead simply have something like a Spearman Rank-Order Correlation calculated comparing the Actual Finish Positions with the Predicted Finish Positions for each player,
and then rank players on the basis of the correlation coefficients for their rankings?
Thanks for any help in clarifying how to conceptualize this before approaching a programmer who gets annoyed when I don't really know what I want him to do ahead of time.
It's a quite interesting problem.
It seems that there are three components that need to be considered in the scoring: the number of correct predictions, the order of correct predictions, and the weight of correct predictions.
For example, assume the truth is:
1,5,10,15,20
Here are some predictions:
1,6,7,8,9 : only predicted first one
2,1,10,21,30 : 1 and 10, but the order of 1 is incorrect
20,15,1,5,30 : hit four in the top 5, but the orders are incorrect
It depends on what you value most. You may first check how many in the top 5 the user has predicted, add a value, and then penalize wrong orders. The weight for each position should also be different, this way
1,5,10,15,20 will rank higher than 1,5,10,20,15 and higher than 1,10,5,20,15
Spearman may be working, but I feel it could be too coarse for your purpose.
This is actually a very similar problem that search engines have. EG, in search engine evaluation, the actual outcomes are preferred results provided by humans, and the predicted outcomes are the results delivered by the search engine. In both your task and for search engines, I'd guess you care a lot more about the accuracy of the winner than the accuracy of the 5th place finisher. If that is the case, then the mean average precision is probably a good measure.

Rating Algorithm

I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.

What class of algorithms can be used to solve this?

EDIT: Just to make sure someone is not breaking their head on the problem... I am not looking for the best optimal algorithm. Some heuristic that makes sense is fine.
I made a previous attempt at formulating this and realized I did not do a great job at it so I removed that question. I have taken another shot at formulating my problem. Please feel free to provide any constructive criticism that can help me improve this.
Input:
N people
k announcements that I can make
Distance that my voice can be heard (say 5 meters) i.e. I may decide to announce or not depending on the number of people within these 5 meters
Goal:
Maximize the total number of people who have heard my k announcements and (optionally) minimize the time in which I can finish announcing all k announcements
Constraints:
Once a person hears my announcement, he is be removed from the total i.e. if he had heard my first announcement, I do not count him even if he hears my second announcement
I can see the same person as well as the same set of people within my proximity
Example:
Let us consider 10 people numbered from 1 to 10 and the following pattern of arrival:
Time slot 1: 1 (payoff = 1)
Time slot 2: 2 3 4 5 (payoff = 4)
Time slot 3: 5 6 7 8 (payoff = 4 if no announcement was made previously in time slot 2, 3 if an announcement was made in time slot 2)
Time slot 4: 9 10 (payoff = 2)
and I am given 2 announcements to make. Now if I were an oracle, I would choose time slots 2 and time slots 3 because then 7 people would have heard (because 5 already heard my announcement in Time slot 2, I do not consider him anymore). I am looking for an online algorithm that will help me make these decisions on whether or not to make an announcement and if so based on what factors. Does anyone have any ideas on what algorithms can be used to solve this or a simpler version of this problem?
There should be an approach relying upon a max-flow algorithm. In essence, you're trying to push the maximum amount of messages from start->end. Though it would be multidimensional, you could have a super-sink, which connects to each value of t, then have each value of t connect to the people you can reach at this time and then have a super-sink. This way, you simply have to compute a max-flow (with the added constraint of no more than k shouts, which should be solvable with a bit of dynamic programming). It's a terrifically dirty way to solve it, but it should get the job done deterministically and without the use of heuristics.
I don't know that there is really a way to solve this or an algorithm to do it the way you have formulated it.
It seems like basically you are trying to reach the maximum number of people with exactly 2 announcements. But without knowing any information about the groups of people in advance, you can't really make any kind of intelligent decision about whether or not to use your first announcement. Your second one at least has the benefit of knowing when not to be used (i.e. if the group has no new members then you can know its not worth wasting the announcement). But it still has basically the same problem.
The only real way to solve this is to use knowledge about the type of data or the desired outcome to make guesses. If you know that groups average 100 people with a standard deviation of 10, then you could just refuse to announce if less than 90 people are present. Or, if you know you need to reach at least 100 people with two announcements, you could choose never to announce to less than 50 at once. Obviously those approaches risk never announcing at all if the actual data does not meet what you would expect. But that's always going to be a risk, since you could get 1 person in the first group and then 0 in all of the rest, no matter what you do.
Or, you could try more clearly defining the problem, I have a hard time figuring out how to relate this to computers.
Lets start my trying to solve the simplest possible variant of the problem: Lets assume N people and K timeslots, but only one possible announcement. Lets also assume that each person will only ever stay for one timeslot and that each person who hasn't yet shown up has an equally probable chance of showing up at any future timeslot.
Given these simplifications, at each timeslot you look at the payoff of announcing at the current timeslot and compare to the chance of a future timeslot having a higher payoff, eg, lets assume 4 people 3 timeslots:
Timeslot 1: Person 1 shows up, so you know you could get a payoff of 1 by announcing, but then you have 3 people to show up in 2 remaining timeslots, so at least one of those timeslots is guaranteed to have 2 people, so don't announce..
So at each timeslot, you can calculate the chance that a later timeslot will have a higher payoff than the current by treating the remaining (N) people and (K) timeslots as being N independent random numbers each from 1..k, and calculate the chance of at least one value k being hit more than or equal to the current-payoff times. (Similar to the Birthday problem, but for more than 1 collision) and then you need to decide hwo much to discount based on expected variances. (bird in the hand, etc)
Generalization of this solution to the original problem is left as an exercise for the reader.

Algorithm to Rate Objects with Numerous Comparisons

Lets say I have a list of 500 objects. I need to rate each one out of 10.
At random I select two and present them to a friend. I then ask the friend which they prefer. I then use this comparison (ie OBJECT1 is better than OBJECT2) to alter the two objects' rating out of ten.
I then repeat this random selection and comparison thousands of times with a group of friends until I have a list of 500 objects with a reliable rating out of ten.
I need to figure out an algorithm which takes the two objects current ratings, and alters them depending on which is thought to be better...
Each object's rating could be (number of victories)/(number of contests entered) * 10. So the rating of the winner goes up a bit and the rating of the loser goes down a bit, according to how many contests they've previously entered.
For something more complicated and less sensitive to the luck of the draw with smaller numbers of trials, I'd suggest http://en.wikipedia.org/wiki/Elo_rating_system, but it's not out of 10. You could rescale everyone's scores so that the top score becomes 10, but then a match could affect everyone's rating, not just the rating of the two involved.
It all sort of depends what "reliable" means. Different friends' judgements will not be consistent with respect to each other, and possibly not even consistent over time for the same person, so there's no "real" sorted order for you to sanity-check the rankings against.
On a more abstruse point, Arrow's Impossibility Theorem states some nice properties that you'd like to have in a system that takes individual preferences and combines them to form an aggregated group preference. It then proceeds to prove that they're mutually inconsistent - you can't have them all. Any intuitive idea of a "good" overall rating runs a real risk of being unachievable.

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits,
deposit, customer, account, accounts
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
The system shall apply deposits to a
customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters =
CatKWs = an array/hash of lists of strings. The list contains the possible
keywords, for a given category.
usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
NbCats = integer number of pre-defined categories
Variables:
CatAccu = an array/hash of numeric values with one entry per each of the
possible categories. usage: CatAccu[3] = 4 (if array) or
CatAccu['CustTx'] += 1 (hash)
TotalKwOccurences = counts the total number of keywords matches (counts
multiple when a word is found in several pre-defined categories)
Pseudo code: (for categorizing one input item)
1. for x in 1 to NbCats
CatAccu[x] = 0 // reset the accumulators
2. for each word W in Item
for each x in 1 to NbCats
if W found in CatKWs[x]
TotalKwOccurences++
CatAccu[x]++
3. for each x in 1 to NbCats
CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating
4. Sort CatAccu by value
5. Return the ordered list of (CategoryID, rating)
for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.
Also, aside from the calculation of the rating per-se, we should also consider:
some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.
# predictor algorithm
for docs in readContent:
for x in range(len(docKywrdmppr)):
catAccum[x]=0
for i in range(len(docKywrdmppr)):
for word in removeStopWords(docs):
if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
print(word)
catAccum[i]=catAccum[i]+counter
print(catAccum)
ind=catAccum.index(max(catAccum))
print(ind)
predictedDoc.append(docKywrdmppr['Document Type'][ind])

Resources