Users of my application (it's a game actually) answer questions to get points. Questions are supplied by other users. Due to volume, I cannot check everything myself, so I decided to crowd-source the filtering process to the users (players). The rules are simple:
each user is shown a question to rate as good/bad/unsure
when question is rated 5 times as "bad" it is removed from the pool
when question is rated 5 times as "good" it is removed from the poll and flagged to be played by other players who have not seen it
If everyone could see everything, this would be easy. However, later in the game phase, users shouldn't get questions they have already seen. This means that users should not see all the questions, and exactly those they don't see would they get to play (answer) later in the game.
Total number of questions is much larger than number of players, new questions are added daily and new players come all the time, so I cannot just distribute in advance.
I'm looking for some algorithm that would maximize the number of rated playable (i.e. unseen) questions for all players.
I tried to google, but I'm not even sure which terms to put in the search box, and using stuff like "distribution", "voting", "collaborative filtering" gives very interesting but unusable results.
Ratio of good vs bad questions is 1:3, ie. 25% of questions are rated good. Number of already submitted unrated questions is over 10000. Number of active users with privilege to vote is around 150.
I'm currently considering splitting the question pool and user base into 2 parts. One part of the user base would check the question for the other part and vice versa. Splitting the questions is easy (odd vs even for example). However, I'm still not sure how to divide the user base. I thought about using odd/even position in "top question checkers" list, however the positions on list changes daily as new questions are checked.
Update: I just asked a sequel to this question - I need to periodically remove a fixed number of questions from the pool.
I'm unaware if there is a specific, well known algorithm for this. However this would be my line of thinking:
"maximize the number of rated playable (i.e. unseen) questions for all players" means both maximising the number of questions with +5 and the number of not-seen questions from each player.
Whatever the algorithm will be, its effectiveness is tied to both the quality of the questions submitted by the contributors and the willingness to rate by other players (unless you force them to rate questions).
The goal of your system should not be that of making all players to have the same amount of "unseen questions" [this is in fact irrelevant], but rather that of always having for each player a "reserve" of unseen questions that allows him/her to play at its normal gamerate. For example: say you have two users A and B who play regularly on your site. A normally answers 80 quizzes per day, while B only 40. If your system in average get 100 new approved questions daily, in principle you would like player A to never see more than 20 of those every day, while player B could safely see 60 of them.
The ratio between submitted question and approved question is also important: if every second submitted question is not good, then users A and B from above could rate 40 and 120 questions daily.
So my approach to the final algorithm would be:
Keep track of the following:
Number of submitted new question per day (F = Flow)
Ratio between good/total submitted questions (Q = quality)
Number of questions used (for playing, not for rating) by each player per day (GR = Game Rate)
Number of questions rated by each player on a given day (RC = Review Counter)
Establish a priority queue of questions to be rated. The goal here is to have approved questions as fast as possible. Give a bonus priority to both:
questions that have collected upvotes already
questions submitted by users who have a history of other questions having already been accepted.
When a player is involved in rating, show him/her the first question in the queue.
Repeat step 3 as much as you want making sure this condition is never met: Q * (F - RC) < GR
[The above can be further tweaked, considering the fact that when the user first register, there will be already a pool of approved but unseen questions on the site]
Of course you can heavily influence the behaviour of users by giving incentives for meritorious activity (badges and reputation points on SO are a self-explanatory example).
EDIT/ADDENDUM: The discussion in the comments clarify that the GR is fixed, and it is one question per day. Furthermore, the OP states that there will be at least 1 new approved question in the system every 24 hours. This means that it is possible to simplify the above algorithm in one of the two forms:
If the user can vote only AFTER he answered his daily question:
If there is at least one approved, unseen question in the system, let the user vote at will.
If the user can vote even BEFORE answering his daily question:
If there are at least two approved, unseen questions in the system, let the user vote at will.
This is such that if a user is voting all votable questions on the system and then answers his daily one at 23:59, there will still be a question available to be answered at 00:00, plus 24h time for the system to acquire a new question for the following day.
HTH!
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am planning to log all user actions like viewed page, tag etc.
What would be a good lean solution to data-mine this data to get recommendations?
Say like:
Figure all the interests from the viewed URL (assuming I know the
associated tags)
Find out people who have similar interests. E.g. John & Jane
viewed URLS related to cars etc
Edit:
It’s really my lack of knowledge in this domain that’s a limiting factor to get started.
Let me rephrase.
Lets say a site like stackoverflow or Quora. All my browsing history going through different questions are recorded and Quora does a data mining job of looking through it and populating my stream with related questions. I go through questions relating to parenting and the next time I login I see streams of questions about parenting. Ditto with Amazon shopping. I browse watches & mixers and two days later they send me a mail of related shopping items that I am interested.
My question is, how do they efficiently store these data and then data mine it to show the next relevant set of data.
Datamining is a method that needs really enormous amounts of space for storage and also enormous amounts of computing power.
I give you an example:
Imagine, you are the boss of a big chain of supermarkets like Wal-Mart, and you want to find out how to place your products in your market so that consumers spend lots of money when they enter your shops.
First of all, you need an idea. Your idea is to find products of different product-groups that are often bought together. If you have such a pair of products, you should place those products as far away as possible. If a customer wants to buy both, he/she has to walk through your whole shop and on this way you place other products that might fit well to one of that pair, but are not sold as often. Some of the customers will see this product and buy it, and the revenue of this additional product is the revenue of your datamining-process.
So you need lots of data. You have to store all data that you get from all buyings of all your customers in all your shops. When a person buys a bottle of milk, a sausage and some bread, then you need to store what goods have been sold, in what amount, and the price. Every buying needs its own ID if you want to get noticed that the milk and the sausage have been bought together.
So you have a huge amount of data of buyings. And you have a lot of different products. Let’s say, you are selling 10.000 different products in your shops. Every product can be paired with every other. This makes 10,000 * 10,000 / 2 = 50,000,000 (50 Million) pairs. And for each of this possible pairs you have to find out, if it is contained in a buying. But maybe you think that you have different customers at a Saturday afternoon than at a Wednesday late morning. So you have to store the time of buying too. Maybee you define 20 time slices along a week. This makes 50M * 20 = 1 billion records. And because people in Memphis might buy different things than people in Beverly Hills, you need the place too in your data. Lets say, you define 50 regions, so you get 50 billion records in your database.
And then you process all your data. If a customer did buy 20 products in one buying, you have 20 * 19 / 2 = 190 pairs. For each of this pair you increase the counter for the time and the place of this buying in your database. But by what should you increase the counter? Just by 1? Or by the amount of the bought products? But you have a pair of two products. Should you take the sum of both? Or the maximum? Better you use more than one counter to be able to count it in all ways you can think of.
And you have to do something else: Customers buy much more milk and bread then champagne and caviar. So if they choose arbitrary products, of course the pair milk-bread has a higher count than the pair champagne-caviar. So when you analyze your data, you must take care of some of those effects too.
Then, when you have done this all you do your datamining-query. You select the pair with the highest ratio of factual count against estimated count. You select it from a database-table with many billion records. This might need some hours to process. So think carefully if your query is really what you want to know before you submit your query!
You might find out that in rural environment people on a Saturday afternoon buy much more beer together with diapers than you did expect. So you just have to place beer at one end of the shop and diapers on the other end, and this makes lots of people walking through your whole shop where they see (and hopefully buy) many other things they wouldn't have seen (and bought) if beer and diapers was placed close together.
And remember: the costs of your datamining-process are covered only by the additional bargains of your customers!
conclusion:
You must store pairs, triples of even bigger tuples of items which will need a lot of space. Because you don't know what you will find at the end, you have to store every possible combination!
You must count those tuples
You must compare counted values with estimated values
Store each transaction as a vector of tags (i.e. visited pages containing these tags). Then do association analysis (i can recommend Weka) on this data to find associations using the "Associate" algorithms available. Effectiveness depends on a lot of different things of course.
One thing that a guy at my uni told me was that often you can simply create a vector of all the products that one person has bought and compare this with other peoples vectors and get decent recommendations. That is represent users as the products they buy or the pages they visit and do e.g. Jaccard similarity calculations. If the "people" are similar then look at products they bought that this person didn't. (Probably those that are the most common in the population of similar people)
Storage is a whole different ballgame, there are many good indices for vector data such as KD trees implemented in different RDBMs.
Take a course in datamining :) or just read one of the excellent textbooks available (I have read Introduction to data mining by Pang-Ning tan et al and its good.)
And regarding storing all the pairs of products etc, of course this is not done and more efficient algorithms based on support and confidence are used to prune the search space.
I should say recommendation is machine learning issue.
how to store the datas depends on which algorithm you chose.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
In the social network movie i saw Mark used Elo rating system
But was Elo rating system necessary ?
can anyone tell me what was the advantage using elo's rating system ?
Can the problem be solved in this way too ?
is there any problem in this algo [written below] ?
Table Structure
Name Name of the woman
Pic_Name [pk] Path to the picture
Impressions Number, the images was shown
Votes Number, people selected as hot
Now we show randomly 2 photos from the database and the hottest woman is selected by Maximum number of Votes
Before voting close/down please write your reason
But was that necessary?
No, there are several different ways of implement such system.
Can anyone tell me what was the advantage using elo's rating system ?
The main advantage and the central idea in Elo's system is that if someone with low rating wins over someone with high rating their ratings are updated by a larger number, than if the two had similar rating to start with. This means that the ratings will converge fairly quickly.
I don't really see how your approach is a good one. First of all it seems like it depends on how often a pic is randomly selected for potential upvoting. Even if you showed all pics equally many times, the property described above doesn't hold. I.e, if some one wins over a really hot girl, she would still get only a single upvote. This means that your approach wouldn't converge as quickly as Elo's system. In fact, the approach you propose doesn't converge to some steady rating-values at all.
Simply counting the number of votes and ranking women by that is not adequate and I can think of two reasons why:
What if a woman is average-looking but by luck her picture get displayed more often? Then she would get more votes and her ranking would rise inappropriately.
What if a woman is average-looking but by luck your website would always compare her to ugly women? The she would get more votes and her ranking would rise inappropriately.
I don't know much about the Elo rating system but it probably doesn't suffer from problems like this.
It's a movie about geeks. Elo is a geeky way to rate competitors on the basis of the results of pairwise contests between them. Its association with chess adds extra geekiness. It's precisely the kind of thing that geeks in movies should be doing.
It may have happened that exactly way in real life too, in which case Zuckerberg probably chose Elo because it's a well-known algorithm for doing this, which has been used in practice in several sports. Why go to the effort of inventing a worse algorithm?
We have question game with Yes or No type of answers. There are multiple teams participating and every team have different number of players. Each player answers to questions. Players can join the game after few of questions has been ended. How to count fairly the all score for the team so we can rank a team?
I would just use the number of correct answers.
First things first: If you have more than one statistic, you have more than one metric. I see an almost infinite number of ranking possibilities. Here's the ones that jump out at me:
Use the average correct answer percentage for the players on the team.
If you have tournaments, have a ranking for tournament win percentage. (You could also use a chess-style ranking to determine the ranking of tournaments.)
Track how many people get a question right wrong. A player's score for getting a question right is (1 - q) where q is the % of people that got that question right. If you get it wrong, you also lose q points. This actually means as other people answer questions, your score may go up or down (which it should, since the purpose is to make it relative to the other players.)
i'll edit more in as I think of them (if I think of them). I really like option 3 though!
Take an example of a question/answer site with a 'browse' slideshow that will show one question/answer page at a time. The user clicks the 'next' button and a new question/answer is presented to him.
I need to decide which pages should be returned each time the user clicks 'next'. Some things I don't want and reasons why:
Showing 'newest' questions in descending order:
Say 100 questions get entered, then no user is going to click thru to the 100th item and it'll never get any responses. It also means if no new questions were asked recently, every time the user visits the site, he'll see the same repeated stale data.
Showing 'most active' questions, judged by a lot of suggested answers/comments:
This won't return those questions that have low activity, which are exactly the ones that need more visibility
Showing 'low activity' questions, judged by not a lot of answers/comments:
Once a question starts getting activity, it'll stop being shown. This will stymie the activity on a question, when I'd really like to encourage discussion.
I feel that a mix of these would work well, but I'm unsure of how to judge which pages should be returned. I'll stress that I don't want the user to have to choose which category of items to view (like how SO has the unanswered/active/newest filters).
Are there any common practices for doing this, or any ideas for how it might be done?
Thanks!
Edit:
Here's what I'm leaning towards so far, with much thanks to Tim's comment:
So far I'm thinking of ranking pages by Activity Count / View Count, where activity is incremented each time a user performs an action on a page, like a vote, comment, answer, etc. View will get incremented for each page every time a person views the page.
I'll then rank all pages by their activity/view ratio and show pages with a high ratio more often. This way pages with low activity and high views will be shown the least, while ones with high activity and low views will be shown most frequently. Low activity/low views and high activity/high views will be somewhere in the middle I imagine, but I'll have to keep a close eye on this in the beta release. I also plan on storing which pages the user has viewed in the past 24 hours so they won't see any repeats in the slideshow in a given day.
Some ideas for preventing 'stale' data (if all the above doesn't seem to prevent it): Perhaps run a cron job which will periodically check for pages that haven't been viewed recently and boost their ratio to put them at the top.
As I see it, you are touching upon two interesting questions:
How to define that a post is interesting to a user: Here you could take a weighted combination of various factors that could contribute to interestingness of a post. Amount of activity, how fresh the entry is, if you have a way of knowing that the item matches users interest etc etc. You could pick the weights based on intuition and see how well the result matches your expectation. If you have the time and inclination, you could collect data on how well your users respond to the entries and try to learn the optimum weights for each factor using machine learning techniques.
How to give new posts a chance, otherwise known as exploration-exploitation tradeoff.
BAsically, if you just keep going to known interesting entries then you will maximize instantaneous user happiness, but you will never learn about new interesting stuff hence, overall your users are unhappy.
This is a very well studies problem, and depending upon how much you want to get into it, you can read up literature on things like k-armed bandit problems.
But a simple solution would be to not pick the entry with the highest score, but pick the entry based on a probability distribution such that high score entries have higher probability of showing up. This way most of the times you show interesting stuff, but every post has a chance to show up occasionally.
I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed.
I'm looking for algorithms to help me identify users which frequently don't vote with the group, indicating that they're probably either not paying attention to the photos, or that they're lying about what they see. I then want to exclude these users, and recalculate the results, such that I can say, with a known amount of confidence, that this particular photo shows a vehicle that is this and that.
This question goes out to all you computer science guys, where to find such algorithms or to give myself the theoretical background to design such algorithms. I'm assuming I'm going to have to learn some probability and statics, maybe some data mining. Some book recommendations would be great. Thanks!
P.S. These are multiple choice questions.
All of these are good suggestions. Thank you! I wish there was a way on stack overflow to select multiple correct answers so more of you could be acknowledged for your contributions!!
Read The Elements of Statistical Learning, it is a great compendium on data mining.
You can be interested especially in unsupervised algorithms, for example clustering. Assuming that most people do not lie, the biggest cluster is right and the rest is wrong. Mark people accordingly, then apply some bayesian statistics and you'll be done.
Of course, most data mining technologies are pretty experimentative, so don't count on that they will be always right... or even in most cases.
I believe what you described is solved using outlier/anomaly detection.
A number of techniques exist:
statistical-based methods
distance-based methods
model-based methods
I suggest you take a look at these slides from the excellent book Introduction to Data Mining
If you know what answers you are expecting why do you ask people to vote? By excluding some values you basically turn the vote in something that you like. Automobiles make different impression to different individuals. If 100 ppl loved a car then when someone comes and says that he/she doesn't like it, you exclude the vote?
But anyway, considering that you still want to do this, first of all you will need a large set o data from "trusted" voters. This will give you an idea of "good" answer and from this point you can choose the exclude threshold.
Without an initial set of data you cannot apply any algorithm because you will get false results. Consider just one vote of 100 from on a scale from 0 to 100. The second vote is "1" The you will exclude this vote because is too far away from the average.
I think a pretty simple algorithm could accomplish this for you. You could try and get fancier by calculating the standard deviations and such, but I wouldn't bother.
Here's a simple approach that should be sufficient:
For each of your users, calculate the number of questions they answered and the number of times they selected the most popular answer for the question. The users which have the lowest ratio of picking the popular answer versus total answers you can guess are providing bogus data.
You probably would not want to throw out the data from users where they've only answered a small number of questions because they likely have just disagreed on a few versus putting in bogus data.
What kind of questions are they (Yes/No, or 1 to 10?).
You may be able to get away with not discarding anything by using a mean instead of an average. With averages if there are extreme outliers in the response it could affect the average, but if you use median you may get a better answer. So for example if you had 5 answers, order them and pick the middle one.
I think what you are saying is that you are concerned that certain people are "outliers", and they are adding noise to your data, making the categorizations less reliable. So, if you have a Chevy Camaro, and most people say it is either a pony car, a muscle car, or a sports car, but you have some goofball who says it's a family sedan, you would want to minimize the impact of his vote.
One thing you could do is provide a Stack Overflow-like reputation score for users:
The more a user is "in agreement" with other users, the better his or her score would be. For a given user (User X), this could be determined by a simple calculation of what percentage of users who responded to a question chose the same category as User X, then averaging this value over all questions answered.
You may want to to multiply this value by the total number of question answered to encourage people to answer as many questions as possible. (Note: if you choose to do this, it would be equivalent to just summing the percentage agreement scores rather than averaging them.)
You could present the final reputation score to users, making sure to explain that they will be rewarded for how well their responses agree with those of other users. This will encourage people to answer more questions but also to take care in their answers.
Finally, you could calculate a certainty score for a given categorization by adding up the total reputation score of all people who chose a given category.
Some of these ideas may need some refinement, especially since I don't know your exact situation. Certainly, if people can see what other people chose before they vote, it would be way too easy to game the system.
If you were to collect votes like "on a scale from 1 to 10, how would you rate this car", you could probably use simple average and standard deviation: the smaller the standard deviation, the more unanimous the general consensus is among your voters, and you can flag users who are e.g. 3 standard devs from the average.
For multiple choice, you need to be more careful. Simply discarding all but the most-voted option will do nothing but disgruntle the voters. You need to establish a measure of how significant the winner is w.r.t. the other options, e.g. flag users who voted for options with less than 1/3 of the winning options count.
Note that I wrote "flag users", not discard votes. If you discard votes, you can't tell how confident you are about the result ("91% voted this to be a Ford Mustang"). If a user has more than a certain percentage of his votes flagged - well, that's up to you.
Your trickiest problem, however, will probably be to collect sufficient votes. Depending on how easy the multiple choice problem is, you probably need several times the number of options as votes, per photo. Otherwise the statistics are meaningless.