Information Retrieval - Adjancey Matrix Graph Sketch, Teleportation Probability, Calculate PageRank - matrix

I am doing a few thing on Information Retrieval and have an exam coming up and I am absolutely clueless. First of, could anyone recommend me the shortest and best description possible for what PageRank actually is in Information Retrieval? Maybe even a good short video or your own description. I know Google use to, or did use it.
I know there are a lot of questions here but I could use as MUCH help as possible in a short length of time.
So my first question (taken from past papers, and making my own examples):
I am wanting to take a table such as:
A B C
A 0 1 0
B 1 0 1
C 0 0 0
And create a graph. I believe this is correct but unsure (I could use a "yes that is correct" or a "no":
And if I was given a graph such as:
The table would be:
A B C
A 0 1 0
B 0 0 1
C 0 0 0
Is that correct? If not, could I please get help and get it described somewhere? The lecture I am reading is not great at explaining and my lecturer isn't great at helping either.
Next I will probably be asked to use Teleportation Probability on the first table. This I desperately need help in. If the probability(the special a symbol)=1/2, does this mean multiply everything, including the 0's in the table such as 0x1/2? also 1x1/2? This is for the matrix of transition probabilities.
Next would be, how can I calculate PageRank from the above matrix. Using matrix multiplication. In words or in Pseudocode.
Another question I want to know is, will a user's page rank on twitter increase if they follow another user? I was assuming this would be a no because they are not following the user back?
Does a user's pagerank depend on how frequently you find said user if you start at a random user and click on another random persona and such till you find them? I assume this one is definitely not true. Because they might not be following said user.
I know this is a lot to ask. Does anyone have tutorials I can follow for either that are not complicated and I can look at and get it mastered today?
Thanks I really appreciate all your help. I know not one person can answer them all but can help provide assistance for some.

here's my stab at answering your questions:
good learning resource:
http://en.wikipedia.org/wiki/PageRank#Simplified_algorithm (no doubt you've see it already, but it's a pretty good one). Start there, understand the algorithm first, then do the implementation.
this might be a good simple method to implement?
http://pr.efactory.de/e-pagerank-algorithm.shtml
or this:
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
I'm guessing you can program in Python (common school language), in that case you might be interested in a package for handling graphs which has pagerank calculations: http://networkx.lanl.gov/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html. If you have to write your own pagerank algorithm (very doable), you could use that to check the results.
For the matrix -> graph conversion question: your professor needs to specify how directionality is encoded in the matrix. Does a 1 at B,C specify a link from B to C or from C to B? My guess would be B to C. If that's true, your first graph is wrong there, but the second graph is ok. Directionality is very important in PageRank.
I believe the Teleportation probability is the probability that a random walker executing a new step will jump to a random node in the graph. It's in the wikipedia page under "damping factor". I don't know how it ties into multiplying numbers in your matrix.
For the Twitter question - yes, I think you have it right. Linking to (or presumably following) a second person does nothing directly to the the first person's pagerank, but it likely increases the second person's pagerank. In practice, there could be secondary effects, like the second person noticing that the first person is interesting and following them back.
second to last question - yes, one formulation of the pagerank algorithm is as a random walk along links with the frequency of encountering a node (page) going into the pagerank.
good luck!

Related

Which algorithm for assigning pattern shifts

I'm looking for a good algorithm or technic to find the best solution for the following problem. First, I’ll introduce the context and then, the problem.
I work for a company with more than 2000 employees; all of them work with pattern shift, this means that any employee has a pattern which specifies the sequence of workday and free day. We have these patterns:
5-2-5-2 (5 days work, 2 free, 5 days work, 2 free) and so on.
5-2-4-3
5-4-5-3
5-3-5-3
At this moment we have all these patterns and different numbers of start, that is to say, a pattern can start at a certain date at a specific part inside the pattern, for e.g., the pattern 5-4-5-3 has 17 possible starting sequences, this number is a sum of 5+4+5+3 = 17 possible sequences.
https://en.wikipedia.org/wiki/Shift_plan
Now the problem,
Every 6 months each employee can change the pattern and start in any sequence number of the pattern.
But we must analyze all the requirements and accept or reject in order to obtain the better combination for the company operation, because we need that every day have the same work force but we understand that this is impossible but the algorithm will help us find a good solution, not perfect.
I was reading about the "Nurse scheduling problem" with Google Or-Tool but I don’t understand how to set Pattern Sequence to create a solution for this problem. I read some opinions about GA (genetic algorithms) and all of them said that this kind of solution is not good for this kind of problem.
Does anyone have a similar problem? Can someone give me a more accurate example with Google OR-tools than the example in GitHub.
it's not necessary to find a strictly optimal solution; the roster is currently done manual, and I'm pretty sure the result is considerably sub-optimal most of the time.
Does anyone have a similar problem?
Sounds a lot like OptaWeb Employee Rostering, which is a vertical on top of OptaPlanner, the constraint solver. Take a look at the source code. It's all open source.
I think this can be modeled as a MIP model.
Thinking out loud:
Introduce a binary decision variable:
δ(i,p) = 1 if pattern i is selected for person p
0 otherwise
This includes the current pattern (say i=0). This would allow the cases:
an employee does not submit a new pattern (then we only have i=0 for this employee)
an employee submits one or more preferable patterns
We have the constraints:
sum(i, δ(i,p)) = 1 ∀p
sum((i,p), pattern(i,p,t)*δ(i,p)) ≈ requiredlevel(t) ∀t
δ(i,p) ∈ {0,1}
here pattern(i,p,t) describes pattern i: it is 1 if period t is covered when pattern (i,p) is used and 0 otherwise. Here I use ≈ to indicate "approximately". (This is easily modeled using slacks and possibly penalty terms in the objective).
Now we maximize
maximize sum((i,p), weight(i,p) * δ(i,p))
where weight(i,p) indicates the preference for a pattern (e.g. weight(0,p)=0 i.e. no bonus points when not selected a newly, preferred pattern).
Something like this should not be too difficult to set up. Of course many refinements are possible. These type of model tend to solve quite quickly.
What is the workflow ?
If you have a fixed roster, and one person proposes a new pattern. Just remove this person contribution, test all (17) starting points of the new pattern and score them.
If you can change patterns, or starting points for more than 1 employee, create an integer variable per starting point. From this starting point, it is easy to compute the persons contribution for each shifted day of the pattern. Then you can optimize quality of service w.r.t. the starting points of each pattern, summing potential contributions per day of the week for each employee.
Is that clear ?

What is Youtube comment system sorting / ranking algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!
I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Find most recent & closest posts, limit 20

I saw a question here recently and bookmark it for further thought. This is the question. What I can't determine myself is if this question is really interesting or nothing special?
Why this is, its because it looked to me that it had a real simple answer sort by lowest distance*time product, or am I missing something obvious?
I can explain the reason why it looked simple to me:
Distance is always somewhat constant no matter when or where the query is ran, meaning that if: My home is at point A and there is a post at point B and another post at point C, no matter when I ran the query I will always get the constant values say 5km & 7km.
The time offset since the post looks like it's also somewhat constant in a sense that it grows equally for all posts. Meaning that if post B is from 2004 and post C is from 2009, now they will be 7 years and 2 years ago respectively. So next year it will be 8 and 3 years ago and so on.
Adding a weight value(s) to 'tweak' the distance & time is not any helpful (not needed) since (taking the values from the two post above) 5*7*alpha will always be more then 2*7*aplha hence no matter when we ran the query post C (2*7*aplha) will always be the 'closest most recent'
Also adding a weight constant to 'tweak' the results seems like it's no longer going to product the most closest and recent but will favor either or in which case I may as well sort by most recent and then by most closest or vise versa. But this is no longer the closest more recent but either the closest then more recent or more recent then closest so both those questions are trivial I believe. So this is why I think tweaking is not a good idea no matter what units are chosen to represent the time offset and distance.
Addition doesn't work as well as multiplication I think but distance*time seems to be sufficient to always get the correct result.
So this is what I was thinking but then I thought, no that can't be that simple. So what am I missing here?
The best way to determine the desired sorting expression would be to let some human beings sort some items manually and deduce the expressions from their answers. It may well be that different persons would give different answers, so that one single expression can't accommodate everyone.
There are other useful polynomial expressions such as t*d + A*t + B*d, where t and d are time and distance. Maybe more precise results can be achieved if we introduce one more polynomial degree, so that expression becomes t*d + A*t*t + B*d*d + C*t + D*d. Only from answers of real humans can you devise this formula.

What class of algorithms can be used to solve this?

EDIT: Just to make sure someone is not breaking their head on the problem... I am not looking for the best optimal algorithm. Some heuristic that makes sense is fine.
I made a previous attempt at formulating this and realized I did not do a great job at it so I removed that question. I have taken another shot at formulating my problem. Please feel free to provide any constructive criticism that can help me improve this.
Input:
N people
k announcements that I can make
Distance that my voice can be heard (say 5 meters) i.e. I may decide to announce or not depending on the number of people within these 5 meters
Goal:
Maximize the total number of people who have heard my k announcements and (optionally) minimize the time in which I can finish announcing all k announcements
Constraints:
Once a person hears my announcement, he is be removed from the total i.e. if he had heard my first announcement, I do not count him even if he hears my second announcement
I can see the same person as well as the same set of people within my proximity
Example:
Let us consider 10 people numbered from 1 to 10 and the following pattern of arrival:
Time slot 1: 1 (payoff = 1)
Time slot 2: 2 3 4 5 (payoff = 4)
Time slot 3: 5 6 7 8 (payoff = 4 if no announcement was made previously in time slot 2, 3 if an announcement was made in time slot 2)
Time slot 4: 9 10 (payoff = 2)
and I am given 2 announcements to make. Now if I were an oracle, I would choose time slots 2 and time slots 3 because then 7 people would have heard (because 5 already heard my announcement in Time slot 2, I do not consider him anymore). I am looking for an online algorithm that will help me make these decisions on whether or not to make an announcement and if so based on what factors. Does anyone have any ideas on what algorithms can be used to solve this or a simpler version of this problem?
There should be an approach relying upon a max-flow algorithm. In essence, you're trying to push the maximum amount of messages from start->end. Though it would be multidimensional, you could have a super-sink, which connects to each value of t, then have each value of t connect to the people you can reach at this time and then have a super-sink. This way, you simply have to compute a max-flow (with the added constraint of no more than k shouts, which should be solvable with a bit of dynamic programming). It's a terrifically dirty way to solve it, but it should get the job done deterministically and without the use of heuristics.
I don't know that there is really a way to solve this or an algorithm to do it the way you have formulated it.
It seems like basically you are trying to reach the maximum number of people with exactly 2 announcements. But without knowing any information about the groups of people in advance, you can't really make any kind of intelligent decision about whether or not to use your first announcement. Your second one at least has the benefit of knowing when not to be used (i.e. if the group has no new members then you can know its not worth wasting the announcement). But it still has basically the same problem.
The only real way to solve this is to use knowledge about the type of data or the desired outcome to make guesses. If you know that groups average 100 people with a standard deviation of 10, then you could just refuse to announce if less than 90 people are present. Or, if you know you need to reach at least 100 people with two announcements, you could choose never to announce to less than 50 at once. Obviously those approaches risk never announcing at all if the actual data does not meet what you would expect. But that's always going to be a risk, since you could get 1 person in the first group and then 0 in all of the rest, no matter what you do.
Or, you could try more clearly defining the problem, I have a hard time figuring out how to relate this to computers.
Lets start my trying to solve the simplest possible variant of the problem: Lets assume N people and K timeslots, but only one possible announcement. Lets also assume that each person will only ever stay for one timeslot and that each person who hasn't yet shown up has an equally probable chance of showing up at any future timeslot.
Given these simplifications, at each timeslot you look at the payoff of announcing at the current timeslot and compare to the chance of a future timeslot having a higher payoff, eg, lets assume 4 people 3 timeslots:
Timeslot 1: Person 1 shows up, so you know you could get a payoff of 1 by announcing, but then you have 3 people to show up in 2 remaining timeslots, so at least one of those timeslots is guaranteed to have 2 people, so don't announce..
So at each timeslot, you can calculate the chance that a later timeslot will have a higher payoff than the current by treating the remaining (N) people and (K) timeslots as being N independent random numbers each from 1..k, and calculate the chance of at least one value k being hit more than or equal to the current-payoff times. (Similar to the Birthday problem, but for more than 1 collision) and then you need to decide hwo much to discount based on expected variances. (bird in the hand, etc)
Generalization of this solution to the original problem is left as an exercise for the reader.

How would I find a book in a large library?

I found the following question while preparing for an interview:
You are in a very huge library that
has no computer access, and you're
looking for one particular book.
You look up where the book suppose to
be from the card catalog, and went to
shelf X to find it.
However the book is not there.
There is only one person that can
answer questions, which is the
libarian, but he only answers yes/no
responses. Plus, his answers might not
be correct.
What is your strategy for finding this
book?
How would you answer this question? What methods of searching would you use?
Use Binary search type questions to narrow the location of the book.
Each question should narrow the search field by half.
"Is the book on this half of the library"? (Point to the right direction).
Would work as an initial question.
You can also use The Knight and the Knave as part of your method of questioning the person. Your first 5 questions (to establish a baseline) could be about things you 'know'. You could determine his error rate from there. After that, you can use Binary Search-esque questions to determine where the book is.
Ask the interviewer for more information about the librarian and go from there. In particular, find out if he's susceptible to bribery (I mean the librarian, but come to think of it this might go for the interviewer as well).
Double-check for dumb mistakes (wrong card, wrong shelf, "661-88" is reall "88-199" and so on).
Search the drawer of borrowed-book cards. If it's been borrowed, note the due date and come back later, or note the borrower's home address and go to plan B.
Look in the vicinity, a few books in either direction and the shelves above and below, in case it was incorrectly reshelved.
Check the tables, floors, photocopiers and return carts.
Look for a gap on the shelf. If there is a gap in the right spot then at least you know you're looking in the right place. If there's no gap then look for a book on that shelf that doesn't belong-- somebody may have swapped them by mistake. If there's no such misplaced book then maybe the book was never on this shelf, see below.
Look for dust on the shelf. It might indicate whether a book has been removed within the past month. Likewise check the index card for signs of age. The flowchart gets a little complicated, but the book may have been lost years ago.
Check the index system: if the book doesn't have the right number for its subject/title/author/whatever, then there is a typo on the index card and you must calculate the correct number yourself to find out where the book really is.
Just go out and buy the damned book, your time is more valuable than this.
Step A: Calibrate your Librarian.
Pick a random book in the library, walk to a random spot and then ask the Librarian if the book (whose location you know) is to your left. Keep testing the Librarian until you have a good estimate of the probability, p, that Librarian answers correctly. Note that if p < 0.5 then you are better off following the opposite of whatever Librarian tells you. If p=0.5 then give up on Librarian -- her responses are no better than a flip of a coin.
If you find that p depends on the question asked (for example, if the Librarian always answers certain questions correctly, but other questions always falsely), then go to Step B1.
Step B1:
If p==0.5 or p depends on the question asked, start thinking outside the box, like Beta suggests.
Step B2:
If p < 0.5, reverse the answer the Librarian gives, and proceed to Step B3.
Step B3:
If p > 0.5: Choose N. If p is close to 1, then N can be a low number like 10. If p is very close to 0.5, then choose N large, like 1000. The right value of N depends on p and how confident you wish to be.
Ask the Librarian the same question N times ("Is the book I'm looking for to my left").
Assume for the moment that whatever response is given more frequently is the "correct answer". Calculate the average response, assigning 1 for the "correct answer" and 0 for the wrong answer. Call this the "observed average".
The responses are like draws from a box with 2 tickets (the right answer and the wrong answer.) The standard deviation of a sample of N draws will be sqrt(pq), where q = 1-p.
The standard error of the average is sqrt(pq/N).
Take the null hypothesis to be that p=0.5 -- that the Librarian is simply giving random responses. The "expected average" (assuming the null hypthesis) is 1/2.
The z-statistic is the
(observed average - expected average)/(standard error of the average) =
(observed average - 0.5)*sqrt(N)/(sqrt(p*q))
The z-statistic follows a normal distribution. If the z-statistic is > 1.65 then you
have about a 95% chance the average response of the Librarian is statistically
significant. If after N questions z is less than 1.65, repeat Step B3 until you get statistically significant response. Note that the larger you choose N, the larger the z-statistic will be, and the easier it will be to obtain statistically significant results.
Step C:
Once you get a statistically significant response, you act upon it (using George Stocker's binary search idea) and hope you have not been statistically unlucky. :)
PS. Although the library might be 3-dimensional, you could play the Binary Search game along the x-axis, then the y-axis, then the z-axis. So the 3-dimensional problem can be reduced to solving 3 (1-dimensional problems).
here's a starting point: Assume the library uses the Dewey decimal system (but any classification system could be substituted).
Question 1: is the book in the 100s?
Question 2: is the book in the 200s?
..
is the book between 50 and 150?
is the book between 150 and 250?
Depends on who you are interviewing for:
Government (non-law enforcement/military) - hire infinite number of staff to check every location in library. Then hire an infinite number of junior managers to manage those staff, add an infinite number of middle managers etc.
Large corporation - same but use unpaid interns.
Government (law enforcement/military) - take librarian, apply tazer or waterboarding until location of book is revealed.
Small company (web 2.0 startup) - blog about location of book until somebody tells you.
Small company (real business) - try another library / bookstore.
Is it cheating to ask if the librarian takes commands? If he does, simply tell him to find the book and bring it back to you.
How would you answer this question?
"Thank you for your time." And I'd get up and walk out of the interview room. I'm not interested in working with people who think that asking silly riddles in an interrview is more useful than asking me to write some code or demonstrate how I would plan a project or lead a team.

Resources