Eliminating deviants while computing score - algorithm

I have a relatively simple algorithmic problem where I recommend questions for users
I have a set of questions with answers (like, comments for each
answer)
I want to score how engaging each question is.
Current implementation:
(total comments + likes for all answers for a question) / sqrt (number of answers)
Problems:
Sometimes, one answer that has a tonne of activity skews the score for the question, even if the other 20 answers generate very little interest
Some reduction should be applied for questions with very few answers.
Would appreciate any suggestions on these 2 problems can be negated.

Usually when we want to avoid letting one sample from being too powerful, the standard way to do it is by one of these:
use log(N) instead of N, making the effect of each observation less powerful1
leave the "strange" observations out: Take only the middle X%, and use them, for example: take only observations that has 1/4 - 3/4 likes, from the max of this question, and leave the skewing examples out.
For the second issue - one thing I can think of is giving a varainting factor: instead using sqrt(number of answers) - you can try (number_of_answers)^(log(number_of_answers+1)/log(max_answers+1)) where max_answers is the maximal number of answer per question in your data set.
It will result in boosting up questions with few answers, which I think is what you are after.
(1): We usually take log(N+1) - so it will be defined for N==0 as well.

Related

What is Youtube comment system sorting / ranking algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!
I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Ranking from pairwise comparisons [duplicate]

This question already has answers here:
How to rank a million images with a crowdsourced sort
(12 answers)
Closed 8 years ago.
Imagine I have a very long list of images and I want to rank them in order of how 'good' people think they are.
I don't want to get users to assign scores to images outright (1 - 10 etc) and order by that, I'd like to try something new.
What I was thinking would be an interesting way of doing it is:
Show a user two random images, they pick the better one
Collect lots of 'comparisons'
Use all the comparisons to come up with some ordering
Turns out this is used regularly, for example (using features, not images), this appears to be the way Uservoice's Smartvote works.
My question is whether there's a good known way to take this long list of comparisons and build a relative ranking for all the images from them but not to the level of complexity found in the research papers.
I've read a bunch of lectures and research papers but I was wondering if there was any sample code out there people might recommend?
Seems like you could just get some kind of numerical ranking system and then just sort based on that. Just borrow the algorithm from a win/loss sport, or chess, and treat each image comparison as a bout.
Did some looking, here's some sample code of what an algorithm like that looks like in Java
And here's a library you can borrow in python
If you search ELO you'll find a version of it in just about any language. Once you get your numerical image rankings, you can sort them any way you like. There are probably other ranking algorithms you could look into for win/loss competition, that was just the first that came up when I googled chess ranking.
For every image, count the number of times it won a duel, and divide by the number of duels it took part in. This ratio is your ranking score.
Example:
A B, A C, A D, B C, B D
Yields
B: 67%, C, D: 50%, A: 33%
Unless you perform a huge number of comparisons, there will be many ties.

Logarithmic growth [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Plain English explanation of Big O
The wikipedia article on logarithmic growth is a stub. Many of the answers I read on stackoverflow give clarification of how efficient a process or function is based on a logarithmic function using 0 (I assume[see below] it is a 0[zero] and not an O[letter as M,N,O,P,Q], but please correct my assumption if it is wrong) and an n or N.
Can someone explain the logarithmic explanations pertaining to the common computational explanations better? maybe in terms of time in seconds(milliseconds are also welcome, just trying to conceptualize it in real life time differences...), terms of size, and/or in terms of weight?
I have seen the following frequently: (please feel free to include other ones as well)
O(1)
O(N)
My assumption is based that a 0 [zero] outside of a code block does not have a slash through it, while inside a code block a 0 does have a slash through it.
What this means is that the execution time (or other resource) is some function of the amount of data. Let's say it takes you 5 minutes to blow up 10 baloons. If the function is O(1), then blowing up 50 baloons also takes 5 minutes. If it is O(n), then blowing up 50 baloons will take 25 minutes.
O(log n) means that as things scale up, managing larger n becomes easier (it follows logorithmic growth). Let's say you want to find an address in a directory of f(n) items. Suppose it takes 6.64 seconds to search a directory of 100 entries. [6.64 = log_2(100)] Then it might only take 7.64 seconds to search 200 entries. That is logorithmic growth.

Find most recent & closest posts, limit 20

I saw a question here recently and bookmark it for further thought. This is the question. What I can't determine myself is if this question is really interesting or nothing special?
Why this is, its because it looked to me that it had a real simple answer sort by lowest distance*time product, or am I missing something obvious?
I can explain the reason why it looked simple to me:
Distance is always somewhat constant no matter when or where the query is ran, meaning that if: My home is at point A and there is a post at point B and another post at point C, no matter when I ran the query I will always get the constant values say 5km & 7km.
The time offset since the post looks like it's also somewhat constant in a sense that it grows equally for all posts. Meaning that if post B is from 2004 and post C is from 2009, now they will be 7 years and 2 years ago respectively. So next year it will be 8 and 3 years ago and so on.
Adding a weight value(s) to 'tweak' the distance & time is not any helpful (not needed) since (taking the values from the two post above) 5*7*alpha will always be more then 2*7*aplha hence no matter when we ran the query post C (2*7*aplha) will always be the 'closest most recent'
Also adding a weight constant to 'tweak' the results seems like it's no longer going to product the most closest and recent but will favor either or in which case I may as well sort by most recent and then by most closest or vise versa. But this is no longer the closest more recent but either the closest then more recent or more recent then closest so both those questions are trivial I believe. So this is why I think tweaking is not a good idea no matter what units are chosen to represent the time offset and distance.
Addition doesn't work as well as multiplication I think but distance*time seems to be sufficient to always get the correct result.
So this is what I was thinking but then I thought, no that can't be that simple. So what am I missing here?
The best way to determine the desired sorting expression would be to let some human beings sort some items manually and deduce the expressions from their answers. It may well be that different persons would give different answers, so that one single expression can't accommodate everyone.
There are other useful polynomial expressions such as t*d + A*t + B*d, where t and d are time and distance. Maybe more precise results can be achieved if we introduce one more polynomial degree, so that expression becomes t*d + A*t*t + B*d*d + C*t + D*d. Only from answers of real humans can you devise this formula.

How would I find a book in a large library?

I found the following question while preparing for an interview:
You are in a very huge library that
has no computer access, and you're
looking for one particular book.
You look up where the book suppose to
be from the card catalog, and went to
shelf X to find it.
However the book is not there.
There is only one person that can
answer questions, which is the
libarian, but he only answers yes/no
responses. Plus, his answers might not
be correct.
What is your strategy for finding this
book?
How would you answer this question? What methods of searching would you use?
Use Binary search type questions to narrow the location of the book.
Each question should narrow the search field by half.
"Is the book on this half of the library"? (Point to the right direction).
Would work as an initial question.
You can also use The Knight and the Knave as part of your method of questioning the person. Your first 5 questions (to establish a baseline) could be about things you 'know'. You could determine his error rate from there. After that, you can use Binary Search-esque questions to determine where the book is.
Ask the interviewer for more information about the librarian and go from there. In particular, find out if he's susceptible to bribery (I mean the librarian, but come to think of it this might go for the interviewer as well).
Double-check for dumb mistakes (wrong card, wrong shelf, "661-88" is reall "88-199" and so on).
Search the drawer of borrowed-book cards. If it's been borrowed, note the due date and come back later, or note the borrower's home address and go to plan B.
Look in the vicinity, a few books in either direction and the shelves above and below, in case it was incorrectly reshelved.
Check the tables, floors, photocopiers and return carts.
Look for a gap on the shelf. If there is a gap in the right spot then at least you know you're looking in the right place. If there's no gap then look for a book on that shelf that doesn't belong-- somebody may have swapped them by mistake. If there's no such misplaced book then maybe the book was never on this shelf, see below.
Look for dust on the shelf. It might indicate whether a book has been removed within the past month. Likewise check the index card for signs of age. The flowchart gets a little complicated, but the book may have been lost years ago.
Check the index system: if the book doesn't have the right number for its subject/title/author/whatever, then there is a typo on the index card and you must calculate the correct number yourself to find out where the book really is.
Just go out and buy the damned book, your time is more valuable than this.
Step A: Calibrate your Librarian.
Pick a random book in the library, walk to a random spot and then ask the Librarian if the book (whose location you know) is to your left. Keep testing the Librarian until you have a good estimate of the probability, p, that Librarian answers correctly. Note that if p < 0.5 then you are better off following the opposite of whatever Librarian tells you. If p=0.5 then give up on Librarian -- her responses are no better than a flip of a coin.
If you find that p depends on the question asked (for example, if the Librarian always answers certain questions correctly, but other questions always falsely), then go to Step B1.
Step B1:
If p==0.5 or p depends on the question asked, start thinking outside the box, like Beta suggests.
Step B2:
If p < 0.5, reverse the answer the Librarian gives, and proceed to Step B3.
Step B3:
If p > 0.5: Choose N. If p is close to 1, then N can be a low number like 10. If p is very close to 0.5, then choose N large, like 1000. The right value of N depends on p and how confident you wish to be.
Ask the Librarian the same question N times ("Is the book I'm looking for to my left").
Assume for the moment that whatever response is given more frequently is the "correct answer". Calculate the average response, assigning 1 for the "correct answer" and 0 for the wrong answer. Call this the "observed average".
The responses are like draws from a box with 2 tickets (the right answer and the wrong answer.) The standard deviation of a sample of N draws will be sqrt(pq), where q = 1-p.
The standard error of the average is sqrt(pq/N).
Take the null hypothesis to be that p=0.5 -- that the Librarian is simply giving random responses. The "expected average" (assuming the null hypthesis) is 1/2.
The z-statistic is the
(observed average - expected average)/(standard error of the average) =
(observed average - 0.5)*sqrt(N)/(sqrt(p*q))
The z-statistic follows a normal distribution. If the z-statistic is > 1.65 then you
have about a 95% chance the average response of the Librarian is statistically
significant. If after N questions z is less than 1.65, repeat Step B3 until you get statistically significant response. Note that the larger you choose N, the larger the z-statistic will be, and the easier it will be to obtain statistically significant results.
Step C:
Once you get a statistically significant response, you act upon it (using George Stocker's binary search idea) and hope you have not been statistically unlucky. :)
PS. Although the library might be 3-dimensional, you could play the Binary Search game along the x-axis, then the y-axis, then the z-axis. So the 3-dimensional problem can be reduced to solving 3 (1-dimensional problems).
here's a starting point: Assume the library uses the Dewey decimal system (but any classification system could be substituted).
Question 1: is the book in the 100s?
Question 2: is the book in the 200s?
..
is the book between 50 and 150?
is the book between 150 and 250?
Depends on who you are interviewing for:
Government (non-law enforcement/military) - hire infinite number of staff to check every location in library. Then hire an infinite number of junior managers to manage those staff, add an infinite number of middle managers etc.
Large corporation - same but use unpaid interns.
Government (law enforcement/military) - take librarian, apply tazer or waterboarding until location of book is revealed.
Small company (web 2.0 startup) - blog about location of book until somebody tells you.
Small company (real business) - try another library / bookstore.
Is it cheating to ask if the librarian takes commands? If he does, simply tell him to find the book and bring it back to you.
How would you answer this question?
"Thank you for your time." And I'd get up and walk out of the interview room. I'm not interested in working with people who think that asking silly riddles in an interrview is more useful than asking me to write some code or demonstrate how I would plan a project or lead a team.

Resources