A/B testing on a news site to improve relevance - algorithm

If you were running a news site that created a list of 10 top news stories, and you wanted to make tweaks to your algorithm and see if people liked the new top story mix better, how would you approach this?
Simple Click logging in the DB associated with the post entry?
A/B testing where you would show one version of the algorithm togroup A and another to group B and measure the clicks?
What sort of characteristics would you base your decision on as to whether the changes were better?

A/B test seems a good start, and randomize the participants. You'll have to remember them so they never see both.
You could treat it like a behavioral psychology experiment, do a T-Test etc...

In addition to monitoring number of clicks, it might also be helpful to monitor how long they look at the story they clicked on. It's more complicated data, but provides another level of information. You would then not only be seeing if the stories you picked out grab the user's attentions, but also that the stories are able to keep it.
You could do statistical analysis (i.e. T-test like Tim suggested), but you probably won't get low enough of a standard deviation on either measure to prove significance. Although, it won't really matter: all you need is for one of the algorithms to have a higher average number of clicks and/or time spent. No need to fool around with hypothesis testing, hopefully.
Of course, there is always the option of simply asking the user if the recommendations were relevant, but that may not be feasible for your situation.


What is the Pinterest's popular pins algorithm?

How they are counting whether a pin is popular or not? I had something like this in mind:
They probably won't tell anyone outside...
But from my experience, popularity algorithms start off like the one you suggested, but tend to be refined and get more complex over time. Every similar algorithm I have been developing so far kept evoluting for months even after go live. As people try to trick the algorithm into making their own stuff more popular than other stuff, additional constraints are added to avoid this.
Most probably there will be more factors involved, e.g.
score depending on user score of the users who took action (reposted, commented, ...), i.e. commented twice by "VIP" users is more popular than commented twice by newbies

How would rating system be done for users in a web application?

I am implementing a web application that has many users and I would give the users rating based on their activities and based on other users liking their activities. How would I implement such an algorithm for that? I am looking for elegant and smart algorithm that could help.
You are basically looking for Scoring Algos. These articles might help -
How not to sort by average rating
Rank hotness with Newtons law of Cooling
How Reddit Ranking Algorithms work
Hope this helps.
Maybe your answer is staring right at you next to your username on this site :-) Stackoverflow.com's scoring system and badges are here to promote certain behaviors on the site. The algorithm is simple and the feedback is immediate so that everybody can see the consequences of certain actions.
What are the ratings used for? If you want to use the ratings as incentives for you users to encourage a specific behavior, then I believe you need to look at disciplines like behavioral psychology to figure out what behaviors you want to measure and reward.
If you already have a user base that reflects the typical user base you're trying to address, you might want to try with simple trial and error. Pick some actions, like e.g. receiving a like on a post and add points to the user's score whenever that happens. Watch the user community's reaction when you introduce the scoring system and see it it helps motivate the behavior you want. If not, try to change some other parameters and repeat.
Depending on your system, some users might try to game the system, so you could find yourself locked into an eternal cat and mouse game once you introduce a rating system (example: Google page ranking).

Estimating a project with many unknowns

I'm working on a project with many unknowns like moving the app from one platform to another.
My original estimations are way off and there is no way I can really know for sure when this will end.
How can i deal with the inability to estimate such a project. It's not that I'm adding a button to a screen or designing a web site, or creating and app or even fixing bugs. These are not methods with bugs, these are assumptions made in the overall code, which are not correct anymore and are found step by step and each analyzed and mitigated with many more unknowns.
I happened to write a master thesis about software-estimation and there are lessons I've learned:
-1st Count, 2nd compute, 3rd judge - this means: first try to identify items in your work which are countable e.g files, classes, LOCs, UIs, etc. Then calculate using this data the effort (in person/days). Use judgement as the last ressort.
-Document your estimation! Show numbers. This minimizes your risk, thus you will present results not as your opinion, but as more or less objective figures. (In general, the more paper the cleaner the backside)
-Estimation is not a commitment. Commitment is one number, estimation is always a range - so give your estimation as a range ( use cone of uncertainty to select the range properly http://www.construx.com/Page.aspx?hid=1648 )
-Devide: Use WBS, devide your work in small pieces and estimate them separately. The granulity depents on the entire length, but at most a working-package soultn't be bigger than 10% of entire effort.
-Estimate effort first, then schedule, then costs.
-Consider estimation as support for planing, reestimate on each project phase (s. cone of uncertainty).
I would suggest the book http://www.stevemcconnell.com/est.htm which deals all these points, in particular how to deal with bosses, who try to pull a commitment from you.
There's no really right answer for coming up with an accurate estimation, because there's no way to know it.
as for estimating the work itself, think about how each step can be divided into separate sub-steps, and break those down even smaller, until you can get a fair picture of as much of the work as you can, with chunks small and discreet enough to give sound estimates for. If you can, come up with both an expected time and a worst-case time, to get a range of where you could land.
Another way to approach this is to ignore the old system. It sounds like a headache. Make an estimate of scraping the old system and implementing a new one from scratch, or integrating a 3rd party, off the shelf solution. If there's a case to be made for this, it is worth at least investigating it.
Tell him more or less what you told us. The project is too volatile too give an accurate estimate and the best you can do is give an estimate for a given task. As long as the number of tasks is unknown so will be the estimate. If he is at all worth his salary he would rather hear this than some made up number. This is not uncommon when dealing with a large legacy code base.
That is a real problem. You can not estimate what you don't have experience in. The only thing you can do is pad your estimate until you think it is a reasonable amount of time. The more unknowns you think there are the more you pad. The less you know about it the more you pad.
I read the below book and it spoke at length about accuracy vs precision. Basically you can be accurate but have a very large range. For instance you can be certain the task will be between 1 day and 1 year to complete. That is not very precise but it is really accurate.
Software Estimation Demystifying...
Some tips for estimating

algorithms to evaluate user responses

I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed.
I'm looking for algorithms to help me identify users which frequently don't vote with the group, indicating that they're probably either not paying attention to the photos, or that they're lying about what they see. I then want to exclude these users, and recalculate the results, such that I can say, with a known amount of confidence, that this particular photo shows a vehicle that is this and that.
This question goes out to all you computer science guys, where to find such algorithms or to give myself the theoretical background to design such algorithms. I'm assuming I'm going to have to learn some probability and statics, maybe some data mining. Some book recommendations would be great. Thanks!
P.S. These are multiple choice questions.
Read The Elements of Statistical Learning, it is a great compendium on data mining.
You can be interested especially in unsupervised algorithms, for example clustering. Assuming that most people do not lie, the biggest cluster is right and the rest is wrong. Mark people accordingly, then apply some bayesian statistics and you'll be done.
Of course, most data mining technologies are pretty experimentative, so don't count on that they will be always right... or even in most cases.
I believe what you described is solved using outlier/anomaly detection.
A number of techniques exist:
statistical-based methods
distance-based methods
model-based methods
I suggest you take a look at these slides from the excellent book Introduction to Data Mining
If you know what answers you are expecting why do you ask people to vote? By excluding some values you basically turn the vote in something that you like. Automobiles make different impression to different individuals. If 100 ppl loved a car then when someone comes and says that he/she doesn't like it, you exclude the vote?
But anyway, considering that you still want to do this, first of all you will need a large set o data from "trusted" voters. This will give you an idea of "good" answer and from this point you can choose the exclude threshold.
Without an initial set of data you cannot apply any algorithm because you will get false results. Consider just one vote of 100 from on a scale from 0 to 100. The second vote is "1" The you will exclude this vote because is too far away from the average.
I think a pretty simple algorithm could accomplish this for you. You could try and get fancier by calculating the standard deviations and such, but I wouldn't bother.
Here's a simple approach that should be sufficient:
For each of your users, calculate the number of questions they answered and the number of times they selected the most popular answer for the question. The users which have the lowest ratio of picking the popular answer versus total answers you can guess are providing bogus data.
You probably would not want to throw out the data from users where they've only answered a small number of questions because they likely have just disagreed on a few versus putting in bogus data.
What kind of questions are they (Yes/No, or 1 to 10?).
You may be able to get away with not discarding anything by using a mean instead of an average. With averages if there are extreme outliers in the response it could affect the average, but if you use median you may get a better answer. So for example if you had 5 answers, order them and pick the middle one.
I think what you are saying is that you are concerned that certain people are "outliers", and they are adding noise to your data, making the categorizations less reliable. So, if you have a Chevy Camaro, and most people say it is either a pony car, a muscle car, or a sports car, but you have some goofball who says it's a family sedan, you would want to minimize the impact of his vote.
One thing you could do is provide a Stack Overflow-like reputation score for users:
The more a user is "in agreement" with other users, the better his or her score would be. For a given user (User X), this could be determined by a simple calculation of what percentage of users who responded to a question chose the same category as User X, then averaging this value over all questions answered.
You may want to to multiply this value by the total number of question answered to encourage people to answer as many questions as possible. (Note: if you choose to do this, it would be equivalent to just summing the percentage agreement scores rather than averaging them.)
You could present the final reputation score to users, making sure to explain that they will be rewarded for how well their responses agree with those of other users. This will encourage people to answer more questions but also to take care in their answers.
Finally, you could calculate a certainty score for a given categorization by adding up the total reputation score of all people who chose a given category.
Some of these ideas may need some refinement, especially since I don't know your exact situation. Certainly, if people can see what other people chose before they vote, it would be way too easy to game the system.
If you were to collect votes like "on a scale from 1 to 10, how would you rate this car", you could probably use simple average and standard deviation: the smaller the standard deviation, the more unanimous the general consensus is among your voters, and you can flag users who are e.g. 3 standard devs from the average.
For multiple choice, you need to be more careful. Simply discarding all but the most-voted option will do nothing but disgruntle the voters. You need to establish a measure of how significant the winner is w.r.t. the other options, e.g. flag users who voted for options with less than 1/3 of the winning options count.
Note that I wrote "flag users", not discard votes. If you discard votes, you can't tell how confident you are about the result ("91% voted this to be a Ford Mustang"). If a user has more than a certain percentage of his votes flagged - well, that's up to you.
Your trickiest problem, however, will probably be to collect sufficient votes. Depending on how easy the multiple choice problem is, you probably need several times the number of options as votes, per photo. Otherwise the statistics are meaningless.

What should be considered when building a Recommendation Engine?

I've read the book Programming Collective Intelligence and found it fascinating. I'd recently heard about a challenge amazon had posted to the world to come up with a better recommendation engine for their system.
The winner apparently produced the best algorithm by limiting the amount of information that was being fed to it.
As a first rule of thumb I guess... "More information is not necessarily better when it comes to fuzzy algorithms."
I know's it's subjective, but ultimately it's a measurable thing (clicks in response to recommendations).
Since most of us are dealing with the web these days and search can be considered a form of recommendation... I suspect I'm not the only one who'd appreciate other peoples ideas on this.
In a nutshell, "What is the best way to build a recommendation ?"
You don't want to use "overall popularity" unless you have no information about the user. Instead, you want to align this user with similar users and weight accordingly.
This is exactly what Bayesian Inference does. In English, it means adjusting the overall probability you'll like something (the average rating) with ratings from other people who generally vote your way as well.
Another piece of advice, but this time ad hoc: I find that there are people where if they like something I will almost assuredly not like it. I don't know if this effect is real or imagined, but it might be fun to build in a kind of "negative effect" instead of just clumping people by similarity.
Finally there's a company specializing in exactly this called SenseArray. The owner (Ian Clarke of freenet fame) is very approachable. You can use my name if you call him up.
There is an entire research area in computer science devoted to this subject. I'd suggest reading some articles.
Agree with #Ricardo. This question is too broad, like asking "What's the best way to optimize a system?"
One common feature to nearly all existing recommendation engines is that making the final recommendation boils down to multiplying some number of matrices and vectors. For example multiply a matrix containing proximity weights between users by a vector of item ratings.
(Of course you have to be ready for most of your vectors to be super sparse!)
(I design recommendation engines professionally.)
#Lao Tzu, I agree with you.
According to me, recommendation engines are made up of:
Context Input fed from context aware systems (logging all your data)
Logical reasoning to filter the most obvious
Expert systems that improve your subjective data over the period of time based on context inputs, and
Probabilistic reasoning to do decision-making close-to-proximity based on weighted sum of previous actions(beliefs, desires, & intentions).
I made such recommendation engine.
