K Nearest Neighbour Algorithm doubt - algorithm

I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.

Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Burger | Pizza | Fries | Burritos | Likes my food
person1 1 | 0 | 1 | 1 | 1
person2 0 | 0 | 1 | 0 | 0
person3 1 | 1 | 0 | 1 | 1
person4 0 | 1 | 1 | 1 | 0
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).

Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.

This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.

If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.

I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.

There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.

Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.

Related

How to give players a score on a ranking/prediction task?

I have a website built with php/mysql, and I am looking for help in communicating to a Programmer what I want him to do with a Poll/Prediction game that I am trying to create.
For purposes of discussion, assume a game where perhaps 100 players try to predict the top 5 finishers in a Golf Tournament of perhaps 9 Golfers.
I am looking for help in how to create and assign a score based upon the accuracy of prediction.
The players provide a rank ordering using a drag and drop function to order the players from 1 through 5. This ordering has already been coded, and the ranks are stored somehow in the DB (I do not know how).
My initial thinking is to ask the coder to create a script which will assign a score from 1 to 5 for each Golfer that the player nominated to be in the Top 5.
So, a player who predicted perfectly would be awarded a perfect score of 12345.
His first golfer received a 1 for finishing first, second a 2 for finishing second, third golfer receives a 3 for finishing third, and so on.
Anybody less than perfect would have a score higher than 12345.
Players who got the first four positions correct would have to be differentiated on the basis of the finish of their fifth Golfer.
So, one might score 12347 and the other 12348 and the player with the highest score (12348) would be the loser in a matchup of the two players.
A player who did poorly, might have a score of 53419.
Question:
Is this a viable way of creating a score which the players of my game can be ranked upon?
Is it possible to instead simply have something like a Spearman Rank-Order Correlation calculated comparing the Actual Finish Positions with the Predicted Finish Positions for each player,
and then rank players on the basis of the correlation coefficients for their rankings?
Thanks for any help in clarifying how to conceptualize this before approaching a programmer who gets annoyed when I don't really know what I want him to do ahead of time.
It's a quite interesting problem.
It seems that there are three components that need to be considered in the scoring: the number of correct predictions, the order of correct predictions, and the weight of correct predictions.
For example, assume the truth is:
1,5,10,15,20
Here are some predictions:
1,6,7,8,9 : only predicted first one
2,1,10,21,30 : 1 and 10, but the order of 1 is incorrect
20,15,1,5,30 : hit four in the top 5, but the orders are incorrect
It depends on what you value most. You may first check how many in the top 5 the user has predicted, add a value, and then penalize wrong orders. The weight for each position should also be different, this way
1,5,10,15,20 will rank higher than 1,5,10,20,15 and higher than 1,10,5,20,15
Spearman may be working, but I feel it could be too coarse for your purpose.
This is actually a very similar problem that search engines have. EG, in search engine evaluation, the actual outcomes are preferred results provided by humans, and the predicted outcomes are the results delivered by the search engine. In both your task and for search engines, I'd guess you care a lot more about the accuracy of the winner than the accuracy of the 5th place finisher. If that is the case, then the mean average precision is probably a good measure.

Simulating amazon.com best seller for books

I was just going through amazon.com and an interesting thing that caught my eye is how they calculate best sells in books.
I was thinking of writing a sample program to calculate this. I was thinking that suppose i am calculating best sellers for the month than just sum the sales count of the individual books and show the top 10. Is it ok or am I missing something?
EDIT
One more interesting thing can happen: suppose one book having id1 was sold 10 pieces on first day but after that it has not been sold but book having id2 is getting sold for 1 or 2 pieces regularly. So how it would affect the best seller calculation. Thanks.
Sounds about right. Depends on how exactly you want to define it.
"best sellers" is the number of units sold.
Another way to do it, if you don't want to fix it to one month is to have some distribution function (like square decay, t^2) and add the counts weighted by the distribution function.
This way, even though you don't have a fixed timed window you look at both new comers and old books. Your function should look like this:
for a_book in books:
score = 0
for a_sale in sales[a_book]:
score += 1 / (days(now() - a_sale.time()) ** 2) # pow 2
I think you get the idea. You can try different functions like exp(days) or different powers. Experiment and see what makes sense for you.

Converting preferences to ratings

Suppose I have a list of (e.g.) restaurants. A lot of users get a list of pairs of restaurants, and select the one of the two they prefer (a la hotornot).
I would like to convert these results into absolute ratings: For each restaurant, 1-5 stars (rating can be non-integer, if necessary).
What are the general ways to go with this problem?
Thanks
I would consider each pairwise decision as a vote in favor of one of the restaurants, and each non-preferred partner as a downvote. Count the votes across all users and restaurants, and then sort cluster them equally (so that that each star "weighs" for a number of votes).
Elo ratings come to mind. It's how the chess world computes a rating from your win/loss/draw record. Losing a matchup against an already-high-scoring restaurant gets penalized less than against a low-scoring one, a little like how PageRank cares more about a link from a website it also ranks highly. There's no upper bound to your possible score; you'd have to renormalize somehow for a 1-5 star system.

Voting algorithm: how to calculate rank?

I am trying to figure our a way to calculate rank. Right now it simply takes ratio of wins / losses of each individual entry, so e.g. one won 99 times out of a 100, it has 99% winning rank. BUT if an entry won 1 out of total 1 votes, it will have a 100% winning rank, but definitely it can't be higher that of the one that won 99 times. What would be a better way to do this?
Try something like this:
votes = wins + losses
score = votes * ( wins / votes )
That way, something with 50% wins, but a million votes would still be ahead of something with 100% wins but only one vote.
You can add in an extra weight based on age (in days in this example), too, something like
if age < 5:
score = score + ((highest real score on site) * ((5 - age) / 5)
This will put brand new entries right at the top of the first page, and then they will move slowly down the list over the course of the next 5 days (I'm assuming age is a fractional number, not just an integer). After the 5 days are up, they will be put in the list based solely on the score from the previous bit of pseudo-code.
Depending on how complicated you want to make it, the Elo system chess uses (or something similar) may be what you want: http://en.wikipedia.org/wiki/Elo_rating_system
Even if a person has won 1/1 matches, his rating would be far below someone who has won/lost hundreds of matches against tough opponents, for instance.
You could always use a point system rather than win/loss ratio. Winning would always give points and then you could play around with either removing points for losing, not awarding points at all for losing, or awarding less points for losing. It all depends on exactly how you want people to be ranked. For example you may want to give 2 points for winning and 1 point for losing if you want to favor people who participate over those who do not (which sounds kind of like what you were talking about in your example of the person playing 100 games vs 1 game). The NHL uses a similar technique for rankings (2 points for a win, 1 point for an overtime loss, 0 points for a regular loss). That might give you some more flexibility.
if i understand the question correctly, then whoever gets more votes has the higher rank.
Would it make sense to add more rank to winning entry if losing entry originally had a much higher rank, e.g. much stronger competitor?

When calculating trends, how do you account for low sample size?

I'm doing some work processing some statistics for home approvals in a given month. I'd like to be able to show trends - that is, which areas have seen a large relative increase or decrease since the last month(s).
My first naive approach was to just calculate the percentage change between two months, but that has problems when the data is very low - any change at all is magnified:
// diff = (new - old) / old
Area | June | July | Diff |
--------------|--------|--------|--------|
South Sydney | 427 | 530 | +24% |
North Sydney | 167 | 143 | -14% |
Dubbo | 1 | 3 | +200% |
I don't want to just ignore any area or value as an outlier, but I don't want Dubbo's increase of 2 per month to outshine the increase of 103 in South Sydney. Is there a better equation I could use to show more useful trend information?
This data is eventually being plotted on Google Maps. In this first attempt, I'm just converting the difference to a "heatmap colour" (blue - decrease, green - no change, red - increase). Perhaps using some other metric to alter the view of each area might be a solution, for example, change the alpha channel based on the total number of approvals or something similar, in this case, Dubbo would be bright red, but quite transparent, whereas South Sydney would be closer to yellow but quite opaque.
Any ideas on the best way to show this data?
Look into measures of statistical significance. It could be as simple as assuming counting statistics.
In a very simple minded version, the thing you plot is
(A_2 - A_1)/sqrt(A_2 + A_1)
i.e. change over 1 sigma in simple counting statistics.
Which makes the above chart look like:
Area Reduced difference
--------------------------
S.S. +3.3
N.S. -1.3
D. +1.0
which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb
1 sigma changes are just noise
3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
5 sigma changes almost certainly point to a trend
Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.
This is really a statistics question. I'm not a statistician, but I suspect the answer is along the lines of well, you have no data — what do you expect‽
Perhaps you could merge Dubbo with a nearby region? You've sliced your data small enough that your signal has fallen below noise.
You could also just not show Dubbo, or make a color for not enough data.
I kinda like your transparency idea -- the data you're confident about is opaque and the data you're not confident is transparent. It's easy for the user to understand, but it will look cluttered.
My take: Don't use heatmap. It's for continuous data, while you have discrete. Use dots. Color represents increase/decrease in the surrounding region and raw volume is proportional to size of the dot.
Now how does user know what region does the dot represent? Where does South Sydney convert into North Sydney? Best approach would be to add voronoi-like guiding lines between the dots, but smartly placed rectangles will do too.
If you happen to have the area of each region in units such as sq. km, you can normalize your data by calculating home approvals/km^2 to get home approval density and use that in your equation rather than the count of home approvals. This is fix the problem if Dubbo contains less home approvals then other regions due to its size. You could also normalize by population if you have that, to get the number of home approvals per person.
Maybe you could use the totals. Add all old and new values which gives old=595, new=676, diff=+13.6%. Then calculate the changes bases on the old total which gives you +17.3% / -4.0% / +0.3% for the three places.
With a heat map you are generally attempting to show easily assimilated information. Anything too complex would probably be counter-productive.
In the case of Dubbo, the reality is that you don't have the data to draw any firm conclusions about it, so I'd color it white, say. You could possibly label it with the difference/current value too.
I think this would be preferable to possibly misleading the users.
I would highly recommend going with a hierarchical model (i.e., partial pooling). Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill is an excellent resource on the topic.
You can use an exact test like Fischer's exact test http://en.wikipedia.org/wiki/Fisher%27s_exact_test , or use the sudent's t test http://en.wikipedia.org/wiki/Student%27s_t-test , both of which are designed for low sample sizes.
As a note, the t-test is pretty much the same as a z-test but in the t-test you don't have to know the standard deviation nor do you have to approximate it like you would if you did a z-test.
You can apply a z or t test without any justification in 99.99% of cases because of the Central Limit Theorem http://en.wikipedia.org/wiki/Central_limit_theorem (formally you only need that the underlying distribution X has finite variance.) You don't need justification for the fisher test either, its exact and does not make any assumptions.

Resources