What is Youtube comment system sorting / ranking algorithm? [closed] - algorithm

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!

I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Related

Information Retrieval - Adjancey Matrix Graph Sketch, Teleportation Probability, Calculate PageRank

I am doing a few thing on Information Retrieval and have an exam coming up and I am absolutely clueless. First of, could anyone recommend me the shortest and best description possible for what PageRank actually is in Information Retrieval? Maybe even a good short video or your own description. I know Google use to, or did use it.
I know there are a lot of questions here but I could use as MUCH help as possible in a short length of time.
So my first question (taken from past papers, and making my own examples):
I am wanting to take a table such as:
A B C
A 0 1 0
B 1 0 1
C 0 0 0
And create a graph. I believe this is correct but unsure (I could use a "yes that is correct" or a "no":
And if I was given a graph such as:
The table would be:
A B C
A 0 1 0
B 0 0 1
C 0 0 0
Is that correct? If not, could I please get help and get it described somewhere? The lecture I am reading is not great at explaining and my lecturer isn't great at helping either.
Next I will probably be asked to use Teleportation Probability on the first table. This I desperately need help in. If the probability(the special a symbol)=1/2, does this mean multiply everything, including the 0's in the table such as 0x1/2? also 1x1/2? This is for the matrix of transition probabilities.
Next would be, how can I calculate PageRank from the above matrix. Using matrix multiplication. In words or in Pseudocode.
Another question I want to know is, will a user's page rank on twitter increase if they follow another user? I was assuming this would be a no because they are not following the user back?
Does a user's pagerank depend on how frequently you find said user if you start at a random user and click on another random persona and such till you find them? I assume this one is definitely not true. Because they might not be following said user.
I know this is a lot to ask. Does anyone have tutorials I can follow for either that are not complicated and I can look at and get it mastered today?
Thanks I really appreciate all your help. I know not one person can answer them all but can help provide assistance for some.
here's my stab at answering your questions:
good learning resource:
http://en.wikipedia.org/wiki/PageRank#Simplified_algorithm (no doubt you've see it already, but it's a pretty good one). Start there, understand the algorithm first, then do the implementation.
this might be a good simple method to implement?
http://pr.efactory.de/e-pagerank-algorithm.shtml
or this:
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
I'm guessing you can program in Python (common school language), in that case you might be interested in a package for handling graphs which has pagerank calculations: http://networkx.lanl.gov/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html. If you have to write your own pagerank algorithm (very doable), you could use that to check the results.
For the matrix -> graph conversion question: your professor needs to specify how directionality is encoded in the matrix. Does a 1 at B,C specify a link from B to C or from C to B? My guess would be B to C. If that's true, your first graph is wrong there, but the second graph is ok. Directionality is very important in PageRank.
I believe the Teleportation probability is the probability that a random walker executing a new step will jump to a random node in the graph. It's in the wikipedia page under "damping factor". I don't know how it ties into multiplying numbers in your matrix.
For the Twitter question - yes, I think you have it right. Linking to (or presumably following) a second person does nothing directly to the the first person's pagerank, but it likely increases the second person's pagerank. In practice, there could be secondary effects, like the second person noticing that the first person is interesting and following them back.
second to last question - yes, one formulation of the pagerank algorithm is as a random walk along links with the frequency of encountering a node (page) going into the pagerank.
good luck!

Eliminating deviants while computing score

I have a relatively simple algorithmic problem where I recommend questions for users
I have a set of questions with answers (like, comments for each
answer)
I want to score how engaging each question is.
Current implementation:
(total comments + likes for all answers for a question) / sqrt (number of answers)
Problems:
Sometimes, one answer that has a tonne of activity skews the score for the question, even if the other 20 answers generate very little interest
Some reduction should be applied for questions with very few answers.
Would appreciate any suggestions on these 2 problems can be negated.
Usually when we want to avoid letting one sample from being too powerful, the standard way to do it is by one of these:
use log(N) instead of N, making the effect of each observation less powerful1
leave the "strange" observations out: Take only the middle X%, and use them, for example: take only observations that has 1/4 - 3/4 likes, from the max of this question, and leave the skewing examples out.
For the second issue - one thing I can think of is giving a varainting factor: instead using sqrt(number of answers) - you can try (number_of_answers)^(log(number_of_answers+1)/log(max_answers+1)) where max_answers is the maximal number of answer per question in your data set.
It will result in boosting up questions with few answers, which I think is what you are after.
(1): We usually take log(N+1) - so it will be defined for N==0 as well.

Find most recent & closest posts, limit 20

I saw a question here recently and bookmark it for further thought. This is the question. What I can't determine myself is if this question is really interesting or nothing special?
Why this is, its because it looked to me that it had a real simple answer sort by lowest distance*time product, or am I missing something obvious?
I can explain the reason why it looked simple to me:
Distance is always somewhat constant no matter when or where the query is ran, meaning that if: My home is at point A and there is a post at point B and another post at point C, no matter when I ran the query I will always get the constant values say 5km & 7km.
The time offset since the post looks like it's also somewhat constant in a sense that it grows equally for all posts. Meaning that if post B is from 2004 and post C is from 2009, now they will be 7 years and 2 years ago respectively. So next year it will be 8 and 3 years ago and so on.
Adding a weight value(s) to 'tweak' the distance & time is not any helpful (not needed) since (taking the values from the two post above) 5*7*alpha will always be more then 2*7*aplha hence no matter when we ran the query post C (2*7*aplha) will always be the 'closest most recent'
Also adding a weight constant to 'tweak' the results seems like it's no longer going to product the most closest and recent but will favor either or in which case I may as well sort by most recent and then by most closest or vise versa. But this is no longer the closest more recent but either the closest then more recent or more recent then closest so both those questions are trivial I believe. So this is why I think tweaking is not a good idea no matter what units are chosen to represent the time offset and distance.
Addition doesn't work as well as multiplication I think but distance*time seems to be sufficient to always get the correct result.
So this is what I was thinking but then I thought, no that can't be that simple. So what am I missing here?
The best way to determine the desired sorting expression would be to let some human beings sort some items manually and deduce the expressions from their answers. It may well be that different persons would give different answers, so that one single expression can't accommodate everyone.
There are other useful polynomial expressions such as t*d + A*t + B*d, where t and d are time and distance. Maybe more precise results can be achieved if we introduce one more polynomial degree, so that expression becomes t*d + A*t*t + B*d*d + C*t + D*d. Only from answers of real humans can you devise this formula.

Need help psuedocoding a dice game and dont know where to even start [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am a beginner IT student and doing a project for my programming logic and design class. I need to create a psuedocode for a dice game that allows you 2 rolls with 5 dice. On the first roll you get to pick 1 die to keep. The computer then rolls the other 4 dice and calculates you're score based on what you rolled. There are 3 rolls per game and the total score is displayed. Rolling nothing takes points away. The scoring is: 2 of a kind=50 points, 3 of a kind=75 points, 4 of a kind=100 points and nothing subtracts 50 points.
The whole problem I have is I dont even know where to start. I think I need this to repeat 3 times, but what variables do set? Please someone help me, I cant really ask my instructor because he is outside smoking the whole class and everything I have learned about this class mostly came from the internet and reading the book. I dont want to fail this class...someone please help me through this???
First of all don't panic. What you are about to do is break the task down into small steps.
Pseudo-code is not really code - you can't use it directly as a language, but instead it is just plain english to describe what it is you are doing and the flow of events.
So what are the initial steps to get you started?
Ask yourself what are the facts, what do you know exist in advance. These are the "declarations" that you make.
You have five dice. Each is a seperate object so each gets it's own variable declaration
dice_1
dice_2
dice_3
dice_4
dice_5
Next decide if each die has an initial value
dice_1 initial value = 0
etc...
Next you know that you have to throw the dice a number of times. Throwing is a variable with an initial value
turns initial value = 2
turns_counter initial value = 2
You should be getting the idea now. Are there other things you should declare in advance? I think so!
Next you have to decide what it is you are doing step by step. Is it just a sequence of events or is it repeating? If it's repeating how do you get it to stop?
While turns_counter is less than 2
Repeat the following:
turns_counter = turns_counter + 1
if turns_counter = 2
Throw. Collect_result. Sum_result.
else
Throw. Collect_result. Sum_result. Remove_a_dice.
endif.
perhaps you have to tell the reusable code which objects they are going to be working with? These are parameters that you pass to the reusable code Throw(dice_1) perhaps also you need to update some variables that you created? do it in the reusable code with or without passing them as parameters.
This is by no means complete or perfect, but you should get the idea about what's going on and how to break it down. It could take quite a while to do.
Most languages provide a pseudo-random number generator function that returns a random number within a certain range. I would start by figuring out which language you'll use and which function it provides.
Once you have that, you will need to call it for each roll of each dice. If you are rolling 5 dice, you would call it 5 times. And you would call it 5 more times for a second roll.
That's a start anyway.
You have already almost answered the question by simply writing it down here. There is no strict definition of what pseudocode is. Why don't you start by re-writing what you've described here as a sequence of steps. Then, for each step simply refine that step further until you think you've made it as fine-grain as you like.
You could start with something like this:
Roll 5 dice.
Pick 1 die to keep.
Rolls the other 4 dice
Calculate the score.
// etc...
Quite weird to think that it's easier to ask SO than your instructor! :)
The easiest way to get started on this is to not rigorously bind yourself to the constraint of a specific language, or even to pseudocode. Simply, in natural English, write out how you would do this. Imagine that YOU are the computer, and somebody wants to play the game with you. Just imagine, in very specific detail, what you would do at each potential step, i.e.
Give the user 5 dice
Ask the user to roll them
From that roll, allow the user to pick one die to keep
...etc. Once you have done this, and you are sure it is correct, start transforming it into pseudo code by thinking about what a computer would need to do to solve this problem. For instance, you'll need a variable keeping track of how many points the user as, as well as how many total rolls have occurred. If you were very specific in your English description of the problem, this should mean you basically only need to plug pseudo code into a few sentences you already have - in other words, you're just substituting one type of pseudo code for another.
I'd like to help, but straight-up providing the pseudo code wouldn't be very helpful to you. One of the hardest steps in beginning programming is learning to break a problem down into its constituent elements. That type of granular thinking is unintuitive at first, but gets easier the more time you spend on it.
Well, pseudo-code, in my experience, is best drawn up when you pretend you're writing up the work for someone else to do:
THINGS WE NEED
Dice
Players
Score
THINGS WE TRACK
Dice rolls
Player score
THINGS WE KNOW
(These are also called constants)
Nothing (-50)
2 of a kind (+50)
3 of a kind (+75)
4 of a kind (+100)
All of these are vital tools to getting started. And...well, asking questions on stackoverflow.
Next, define your "actions" (things we do), which utilizes the above known things that we will need.
I would start the same place I always do: creating our things.
def player():
"""Create a new player"""
def dice():
"""Creates 4 new, 6 sided dice"""
def welcome():
"""Welcome player by name, give option to quit"""
def game():
"""Initialize number of turns (start at 0)"""
def humanturn():
"""Roll dice, display, ask which one they'll keep"""
def compturn():
"""Roll four dice"""
def check():
"""Check for any matches in the dice"""
def score():
"""Tally up the score for any matches"""
def endturn():
"""Update turn(s), update total score"""
def gameover():
"""Display name, total score, ask for retry"""
def quit():
"""Quit the game"""
Those are your components, all fleshed out in a very procedural manner. There are many other ways to do this that are much better, but for now you're just writing the skeleton of an idea. You may be tempted to combine many of these methods together when you're ready to start coding, but it's a good idea to separate everything until you're confident you won't get lost chasing down a bug.
Good luck!

How would I find a book in a large library?

I found the following question while preparing for an interview:
You are in a very huge library that
has no computer access, and you're
looking for one particular book.
You look up where the book suppose to
be from the card catalog, and went to
shelf X to find it.
However the book is not there.
There is only one person that can
answer questions, which is the
libarian, but he only answers yes/no
responses. Plus, his answers might not
be correct.
What is your strategy for finding this
book?
How would you answer this question? What methods of searching would you use?
Use Binary search type questions to narrow the location of the book.
Each question should narrow the search field by half.
"Is the book on this half of the library"? (Point to the right direction).
Would work as an initial question.
You can also use The Knight and the Knave as part of your method of questioning the person. Your first 5 questions (to establish a baseline) could be about things you 'know'. You could determine his error rate from there. After that, you can use Binary Search-esque questions to determine where the book is.
Ask the interviewer for more information about the librarian and go from there. In particular, find out if he's susceptible to bribery (I mean the librarian, but come to think of it this might go for the interviewer as well).
Double-check for dumb mistakes (wrong card, wrong shelf, "661-88" is reall "88-199" and so on).
Search the drawer of borrowed-book cards. If it's been borrowed, note the due date and come back later, or note the borrower's home address and go to plan B.
Look in the vicinity, a few books in either direction and the shelves above and below, in case it was incorrectly reshelved.
Check the tables, floors, photocopiers and return carts.
Look for a gap on the shelf. If there is a gap in the right spot then at least you know you're looking in the right place. If there's no gap then look for a book on that shelf that doesn't belong-- somebody may have swapped them by mistake. If there's no such misplaced book then maybe the book was never on this shelf, see below.
Look for dust on the shelf. It might indicate whether a book has been removed within the past month. Likewise check the index card for signs of age. The flowchart gets a little complicated, but the book may have been lost years ago.
Check the index system: if the book doesn't have the right number for its subject/title/author/whatever, then there is a typo on the index card and you must calculate the correct number yourself to find out where the book really is.
Just go out and buy the damned book, your time is more valuable than this.
Step A: Calibrate your Librarian.
Pick a random book in the library, walk to a random spot and then ask the Librarian if the book (whose location you know) is to your left. Keep testing the Librarian until you have a good estimate of the probability, p, that Librarian answers correctly. Note that if p < 0.5 then you are better off following the opposite of whatever Librarian tells you. If p=0.5 then give up on Librarian -- her responses are no better than a flip of a coin.
If you find that p depends on the question asked (for example, if the Librarian always answers certain questions correctly, but other questions always falsely), then go to Step B1.
Step B1:
If p==0.5 or p depends on the question asked, start thinking outside the box, like Beta suggests.
Step B2:
If p < 0.5, reverse the answer the Librarian gives, and proceed to Step B3.
Step B3:
If p > 0.5: Choose N. If p is close to 1, then N can be a low number like 10. If p is very close to 0.5, then choose N large, like 1000. The right value of N depends on p and how confident you wish to be.
Ask the Librarian the same question N times ("Is the book I'm looking for to my left").
Assume for the moment that whatever response is given more frequently is the "correct answer". Calculate the average response, assigning 1 for the "correct answer" and 0 for the wrong answer. Call this the "observed average".
The responses are like draws from a box with 2 tickets (the right answer and the wrong answer.) The standard deviation of a sample of N draws will be sqrt(pq), where q = 1-p.
The standard error of the average is sqrt(pq/N).
Take the null hypothesis to be that p=0.5 -- that the Librarian is simply giving random responses. The "expected average" (assuming the null hypthesis) is 1/2.
The z-statistic is the
(observed average - expected average)/(standard error of the average) =
(observed average - 0.5)*sqrt(N)/(sqrt(p*q))
The z-statistic follows a normal distribution. If the z-statistic is > 1.65 then you
have about a 95% chance the average response of the Librarian is statistically
significant. If after N questions z is less than 1.65, repeat Step B3 until you get statistically significant response. Note that the larger you choose N, the larger the z-statistic will be, and the easier it will be to obtain statistically significant results.
Step C:
Once you get a statistically significant response, you act upon it (using George Stocker's binary search idea) and hope you have not been statistically unlucky. :)
PS. Although the library might be 3-dimensional, you could play the Binary Search game along the x-axis, then the y-axis, then the z-axis. So the 3-dimensional problem can be reduced to solving 3 (1-dimensional problems).
here's a starting point: Assume the library uses the Dewey decimal system (but any classification system could be substituted).
Question 1: is the book in the 100s?
Question 2: is the book in the 200s?
..
is the book between 50 and 150?
is the book between 150 and 250?
Depends on who you are interviewing for:
Government (non-law enforcement/military) - hire infinite number of staff to check every location in library. Then hire an infinite number of junior managers to manage those staff, add an infinite number of middle managers etc.
Large corporation - same but use unpaid interns.
Government (law enforcement/military) - take librarian, apply tazer or waterboarding until location of book is revealed.
Small company (web 2.0 startup) - blog about location of book until somebody tells you.
Small company (real business) - try another library / bookstore.
Is it cheating to ask if the librarian takes commands? If he does, simply tell him to find the book and bring it back to you.
How would you answer this question?
"Thank you for your time." And I'd get up and walk out of the interview room. I'm not interested in working with people who think that asking silly riddles in an interrview is more useful than asking me to write some code or demonstrate how I would plan a project or lead a team.

Resources