How would I find a book in a large library? - algorithm

I found the following question while preparing for an interview:
You are in a very huge library that
has no computer access, and you're
looking for one particular book.
You look up where the book suppose to
be from the card catalog, and went to
shelf X to find it.
However the book is not there.
There is only one person that can
answer questions, which is the
libarian, but he only answers yes/no
responses. Plus, his answers might not
be correct.
What is your strategy for finding this
book?
How would you answer this question? What methods of searching would you use?

Use Binary search type questions to narrow the location of the book.
Each question should narrow the search field by half.
"Is the book on this half of the library"? (Point to the right direction).
Would work as an initial question.
You can also use The Knight and the Knave as part of your method of questioning the person. Your first 5 questions (to establish a baseline) could be about things you 'know'. You could determine his error rate from there. After that, you can use Binary Search-esque questions to determine where the book is.

Ask the interviewer for more information about the librarian and go from there. In particular, find out if he's susceptible to bribery (I mean the librarian, but come to think of it this might go for the interviewer as well).
Double-check for dumb mistakes (wrong card, wrong shelf, "661-88" is reall "88-199" and so on).
Search the drawer of borrowed-book cards. If it's been borrowed, note the due date and come back later, or note the borrower's home address and go to plan B.
Look in the vicinity, a few books in either direction and the shelves above and below, in case it was incorrectly reshelved.
Check the tables, floors, photocopiers and return carts.
Look for a gap on the shelf. If there is a gap in the right spot then at least you know you're looking in the right place. If there's no gap then look for a book on that shelf that doesn't belong-- somebody may have swapped them by mistake. If there's no such misplaced book then maybe the book was never on this shelf, see below.
Look for dust on the shelf. It might indicate whether a book has been removed within the past month. Likewise check the index card for signs of age. The flowchart gets a little complicated, but the book may have been lost years ago.
Check the index system: if the book doesn't have the right number for its subject/title/author/whatever, then there is a typo on the index card and you must calculate the correct number yourself to find out where the book really is.
Just go out and buy the damned book, your time is more valuable than this.

Step A: Calibrate your Librarian.
Pick a random book in the library, walk to a random spot and then ask the Librarian if the book (whose location you know) is to your left. Keep testing the Librarian until you have a good estimate of the probability, p, that Librarian answers correctly. Note that if p < 0.5 then you are better off following the opposite of whatever Librarian tells you. If p=0.5 then give up on Librarian -- her responses are no better than a flip of a coin.
If you find that p depends on the question asked (for example, if the Librarian always answers certain questions correctly, but other questions always falsely), then go to Step B1.
Step B1:
If p==0.5 or p depends on the question asked, start thinking outside the box, like Beta suggests.
Step B2:
If p < 0.5, reverse the answer the Librarian gives, and proceed to Step B3.
Step B3:
If p > 0.5: Choose N. If p is close to 1, then N can be a low number like 10. If p is very close to 0.5, then choose N large, like 1000. The right value of N depends on p and how confident you wish to be.
Ask the Librarian the same question N times ("Is the book I'm looking for to my left").
Assume for the moment that whatever response is given more frequently is the "correct answer". Calculate the average response, assigning 1 for the "correct answer" and 0 for the wrong answer. Call this the "observed average".
The responses are like draws from a box with 2 tickets (the right answer and the wrong answer.) The standard deviation of a sample of N draws will be sqrt(pq), where q = 1-p.
The standard error of the average is sqrt(pq/N).
Take the null hypothesis to be that p=0.5 -- that the Librarian is simply giving random responses. The "expected average" (assuming the null hypthesis) is 1/2.
The z-statistic is the
(observed average - expected average)/(standard error of the average) =
(observed average - 0.5)*sqrt(N)/(sqrt(p*q))
The z-statistic follows a normal distribution. If the z-statistic is > 1.65 then you
have about a 95% chance the average response of the Librarian is statistically
significant. If after N questions z is less than 1.65, repeat Step B3 until you get statistically significant response. Note that the larger you choose N, the larger the z-statistic will be, and the easier it will be to obtain statistically significant results.
Step C:
Once you get a statistically significant response, you act upon it (using George Stocker's binary search idea) and hope you have not been statistically unlucky. :)
PS. Although the library might be 3-dimensional, you could play the Binary Search game along the x-axis, then the y-axis, then the z-axis. So the 3-dimensional problem can be reduced to solving 3 (1-dimensional problems).

here's a starting point: Assume the library uses the Dewey decimal system (but any classification system could be substituted).
Question 1: is the book in the 100s?
Question 2: is the book in the 200s?
..
is the book between 50 and 150?
is the book between 150 and 250?

Depends on who you are interviewing for:
Government (non-law enforcement/military) - hire infinite number of staff to check every location in library. Then hire an infinite number of junior managers to manage those staff, add an infinite number of middle managers etc.
Large corporation - same but use unpaid interns.
Government (law enforcement/military) - take librarian, apply tazer or waterboarding until location of book is revealed.
Small company (web 2.0 startup) - blog about location of book until somebody tells you.
Small company (real business) - try another library / bookstore.

Is it cheating to ask if the librarian takes commands? If he does, simply tell him to find the book and bring it back to you.

How would you answer this question?
"Thank you for your time." And I'd get up and walk out of the interview room. I'm not interested in working with people who think that asking silly riddles in an interrview is more useful than asking me to write some code or demonstrate how I would plan a project or lead a team.

Related

Can two groups of N people find each other around a circle? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
This is an algorithmic problem and I'm not sure it has a solution. I think it's a specific case of a more generic computer science problem that has no solution but I'd rather not disclose which one to avoid planting biases. It came up from a real life situation in which mobile phones were out of credit and thus, we didn't have long range communications.
Two groups of people, each with 2 people (but it might be true for N people) arranged to meet at the center of a park but at the time of meeting, the park is closed. Now, they'll have to meet somewhere else around the park. Is there an algorithm each and every single individual could follow to converge all in one point?
For example, if each group splits in two and goes around and when they find another person keep on going with that person, they would all converge on the other side of the park. But if the other group does the same, then, they wouldn't be able to take the found members of the other group with them. This is not a possible solution.
I'm not sure if I explained well enough. I can try to draw a diagram.
Deterministic Solution for N > 1, K > 1
For N groups of K people each.
Since the problem is based on people whose mobile phones are out of credit, let's assume that each person in each group has their own phone. If that's not acceptable, then substitute the phone with a credit card, social security, driver's license, or any other item with numerical identification that is guaranteed to be unique.
In each group, each person must remember the highest number among that group, and the person with the highest number (labeled leader) must travel clockwise around the perimeter while the rest of the group stays put.
After the leader of each group meets the next group, they compare their number with the group's previous leader number.
If the leader's number is higher than the group's previous leader's number, then the leader and the group all continue along the perimeter of the park. If the group's previous leader's number is higher, then they all stay put.
Eventually the leader with the highest number will continue around the entire perimeter exactly 1 rotation, collecting the entire group.
Deterministic solution for N > 1, K = 1 (with one reasonable assumption of knowledge ahead-of-time)
In this case, each group only contains one person. Let's assume that the number used is a phone number, because it is then reasonable to also assume that at least one pair of people will know each other's numbers and so one of them will stay put.
For N = 2, this becomes trivially reduced to one person staying put and the other person going around clockwise.
For other cases, the fact that at least two people will initially know each other's numbers will effectively increase the maximum K to at least 2 (because the person or people who stay put will continue to stay put if the person they know has a higher number than the leader who shows up to meet them), but we still have to introduce one more step to the algorithm to make sure it will terminate.
The extra step is that if a leader has continued around the perimeter for exactly one rotation without adding anyone to the group, then the leader must leave their group behind and start over for one more rotation around the perimeter. This means that a leader with no group will continue indefinitely until they find someone else, which is good.
With this extra step, it is easy to see why we have to assume that at least one pair of people need to know each other's phone numbers ahead of time, because then we can guarantee that the person who stays put will eventually accumulate the entire group.
Feel free to leave comments or suggestions to improve the algorithm I've laid out or challenge me if you think I missed an edge case. If not, then I hope you liked my answer.
Update
For fun, I decided to write a visual demo of my solutions to the problem using d3. Feel free to play around with the parameters and restart the simulation with any initial state. Here's the link:
https://jsfiddle.net/patrob10114/c3d478ty/show/
Key
black - leader
white - follower
when clicked
blue - selected person
green - known by selected person
red - unknown by selected person
Note that collaboration occurs at the start of every step, so if two groups just combined in the current step, most people won't know the people from the opposite group until after the next step is invoked.
They should move towards the northernmost point of the park.
I'd send both groups in a random direction. If they went a half circle without meeting the other group, rerandomize the directions. This will make them meet in a few rounds most of the time, however there is an infinitely small chance that they still never meet.
It is not possible with a deterministic algorithm if
• we have to meet at some point on the perimeter,
• we are unable to distinguish points on the perimeter (or the algorithm is not allowed to use such a distinction),
• we are unable to distinguish individuals in the groups (or the algorithm is not allowed to use such a distinction),
• the perimeter is circular (see below for a more general case),
• we all follow the same algorithm, and
• the initial points may be anywhere on the perimeter.
Proof: With a deterministic algorithm we can deduce the final positions from the initial positions, but the groups could start evenly spaced around the perimeter, in which case the problem has rotational symmetry and so the solution will be unchanged by a 1/n rotation, which however has no fixed point on the perimeter.
Status of assumptions
Dropping various assumptions leads, as others have observed to various solutions:
Non-deterministic: As others have observed, various non-deterministic algorithms do provide a solution whose probability of termination tends to certainty as time tends to infinity; I suspect almost any random walk would do. (Many answers)
Points indistinguishable: Agree on a fixed point at which to meet if needed: flyx’s answer.
Individuals indistinguishable: If there is a perfect hash algorithm, choose those with the lowest hash to collect others: Patrick Roberts’s solution.
Same algorithm: Choose one in advance to collect the others (adapting Patrick Roberts’s solution).
Other assumptions can be weakened:
Non-circular perimeter: The condition that the perimeter be circular is rather artificial, but if the perimeter is topologically equivalent to a circle, this equivalence can be used to convert any solution to a solution to the circle problem.
Unrestricted initial points: Even if the initial points cannot be evenly spaced, as long as some points are distinct, a topological equivalence (as for a non-circular perimeter) reduces a solution to a solution to the circular case, showing that no solution can exist.
I think this question really belongs on Computer Science Stack Exchange.
This question heavily depends on what kind of operations do we have and what do you consider your environment looks like. I asked your this questions with no reply, so here is my interpretation:
The park is a 2d space, 2 groups are located randomly, each group has the same right/left (both are facing the park). Both have the same operations are programmed to do absolutely the same things (nothing like I go right, and you go left, because this makes the problem obvious). So the operations are: Go right/left/stop for x units of time. They can also figure out that they passed through their original position (the one in which they started). And they can be programmed in a loop.
If you have an ability to use randomness - everything is simple. You can come up with many solutions. For example: with probability 0.5 each of them decide to that they will either do 3 steps right and wait. Or one step right and wait. If you will do this operation in a loop and they will select different options, then clearly they will meet (one is faster than the other, so he will reach a slower person). If they both select the same operation, than they will make a circle and both reach their starting positions. In this case roll the dice one more time. After N circles the probability that they will meet will be 1 - 0.5^n (which approaches 1 very fast)
Surprisingly, there is a way to do it! But first we have to define our terms and assumptions.
We have N=2 "teams" of K=2 "agents" apiece. Each "agent" is running the same program. They can't tell north from south, but they can tell clockwise from counterclockwise. Agents in the same place can talk to each other; agents in different places can't.
Your suggested partial answer was: "If each group splits in two and goes around and when they find another person keep on going with that person, they would all converge on the other side of the park..." This implies that our agents have some (magic, axiomatic) face-to-face decision protocol, such that if Alice and Bob are on the same team and wake up at the same point on the circle, they can (magically, axiomatically) decide amongst themselves that Alice will head clockwise and Bob will head counterclockwise (as opposed to Alice and Bob always heading in exactly the same direction because by definition they react exactly the same way to the situation they're identically in).
One way to implement this magic decision protocol is to give each agent a personal random number generator. Whenever 2 or more agents are gathered at a certain point, they all roll a million-sided die, and whichever one rolls highest is acknowledged as the leader. So in your partial solution, Alice and Bob could each roll: whoever rolls higher (the "leader") goes clockwise and sends the other agent (the "follower") counterclockwise.
Okay, having solved the "how do our agents make decisions" issue, let's solve the actual puzzle!
Suppose our teams are (Alice and Bob) and (Carl and Dave). Alice and Carl are the initially elected leaders.
Step 1: Each team rolls a million-sided die to generate a random number. The semantics of this number are "The team with the higher number is the Master Team," but of course neither team knows right now who's got the higher number. But Alice and Bob both know that their number is let's say 424202, and Carl and Dave both know that their number is 373287.
Step 2: Each team sends its leader around the circle clockwise, while the follower stays stationary. Each leader stops moving when he gets to where the other team's follower is waiting. So now at one point on the circle we have Alice and Dave, and at the other point we have Carl and Bob.
Step 3: Alice and Dave compare numbers and realize that Alice's team is the Master Team. Likewise, Bob and Carl compare numbers and realize that Bob's team is the Master Team.
Step 4: Alice being the leader of the Master Team, she takes Dave with her clockwise around the circle. Bob and Carl (being a follower and a leader of a non-master team respectively) just stay put. When Alice and Dave reach Bob and Carl, the problem is solved!
Notice that Step 1 requires that both teams roll a million-sided die in isolation; if during Step 3 everyone realizes that there was a tie, they'll just have to backtrack and try again. Therefore this solution is still probabilistic... but you can make its expected time arbitrarily small by just replacing everyone's million-sided dice with trillion-sided, quintillion-sided, bazillion-sided... dice.
The general strategy here is to impose a pecking order on all N×K agents, and then bounce them around the circle until everyone is aware of the pecking order; then the top pecker can just sweep around the circle and pick everyone up.
Imposing a pecking order can be done by using the agents' personal random number generators.
The protocol for K>2 agents per team is identical to the K=2 case: you just glom all the followers together in Step 1. Alice (the leader) goes clockwise while Bobneric (the followers) stay still; and so on.
The protocol for K=1 agents per team is... well, it's impossible, because no matter what you do, you can't deterministically ensure that anyone will ever encounter another agent. You need a way for the agents to ensure, without communicating at all, that they won't all just circle clockwise around the park forever.
One thing that would help with (but not technically solve) the K=1 case would be to consider the relative speeds of the agents. You might be familiar with Floyd's "Tortoise and Hare" algorithm for finding a loop in a linked list. Well, if the agents are allowed to move at non-identical speeds, then you could certainly do a "continuous, multi-hare" version of that algorithm:
Step 1: Each agent rolls a million-sided die to generate a random number S, and starts running clockwise around the park at speed S.
Step 2: Whenever one agent catches up to another, both agents glom together and start running clockwise at a new random speed.
Step 3: Eventually, assuming that nobody picked exactly the same random speeds, everyone will have met up.
This protocol requires that Alice and Carl not roll identical numbers on their million-sided dice even when they are across the park from each other. IMHO, this is a very different assumption from the other protocol's assuming that Alice and Bob could roll different numbers on their million-sided dice when they were in the same place. With K=1, we're never guaranteed that two agents will ever be in the same place.
Anyway, I hope this helps. The solution for N>2 teams is left as an exercise for the reader, but my intuition is that it'll be easy to reduce the N>2 case to the N=2 case.
Each group sends out a scout with the remaining group members remaining stationary. Each group remembers the name of their scout. The scouts circle around clockwise, and whenever he meets a group, they compare names of their scouts:
If scout's name is earlier alphabetically: group follows him.
If scout's name is later: he joins the group and gives up his initial group identity.
By the time the lowest named scout makes it back the his starting location, everyone who hasn't stopped at his initial location should be following him.
There are some solutions here that to me are unsatisfactory since they require the two teams to agree a strategy in advance and all follow the same deterministic or probabilistic rules. If you had the opportunity to agree in advance what rules you're all going to follow, then as flyx points out you could just have agreed a backup meeting point. Restrictions that prevent the advance choice of a particular place or a particular leader are standard in the context of some problems with computer networks but distinctly un-natural for four friends planning to meet up. Therefore I will frame a strategy from the POV of only one team, assuming that there has been no prior discussion of the scenario between the two teams.
Note that it is not possible to be robust in the face of any strategy from the other team. The other team can always force a stalemate simply by adopting some pattern of movement that ensures those two will never meet again.
One of you sets out walking around the park. The other stands still, let us say at position X. This ensures that: (a) you will meet each other periodically at X, let us say every T seconds; and (b) for each member of the other team, no matter how they move around the perimeter of the park they must encounter at least one of your team at least every T seconds.
Now you have communication among all members of both groups, and (given sufficient time and passing-on of messages from one person to another) the problem resolves to the same problem as if your mobile phones were working. Choosing a leader by random number is one way to solve it as others have suggested. Note that there are still two issues: the first is a two-generals problem with communication, and I suppose you might feel that a mobile phone conversation allows for the generation of common knowledge whereas these relayed notes do not. The second is the possibility that the other team refuses to co-operate and you cannot agree a meeting point no matter what.
Notwithstanding the above problems, the question supposes that if they had communication that the groups would be able to agree a meeting-point. You have communication: agree a meeting point!
As for how to agree a meeting point, I think it requires some appeal to reason or good intention on the part of the other team. If they are due to meet again, then they will be very reluctant to take any action that results in them breaking their commitment to their partner. Therefore suggest to them both that after their next meeting, when all commitments can be forgiven, they proceed together to X by the shortest route. Listen to their counter-proposal and try to find some common solution.
To help reach a solution, you could pre-agree with your team-mate some variations you'd be willing to make to your plan, provided that they remain within some restrictions that ensure you will meet your team-mate again. For example, if the stationary team-mate agrees that they could be persuaded to set out clockwise, and the moving team-mate sets out anti-clockwise and agrees that they can be persuaded to do something different but not to cross point X in a clockwise direction, then you're guaranteed to meet again and so you can accept certain suggestions from the other team.
Just as an example, if a team following this strategy meets a team (unwisely) following your strategy, then one of my team will agree to go along with the one of your team they meet, and the other will refuse (since it would require them to make the forbidden movement above). This means that when your team meet together, they'll have one of my team with them for a group of three. The loose member of my team is on a collision course with that group of three provided your team doesn't do anything perverse.
I think forming any group of three is a win, so each member should do anything they can to attend a meeting of the other team, subject to the constraints they agreed to guarantee they'll meet up with their own team member again. A group of 3, once formed, should follow whatever agreement is in place to meet the loose member (and if the team of two contained within that 3 refuses to do this then they're saboteurs, there is no good reason for them to refuse). Within these restrictions, any kind of symmetry-breaking will allow the team following these principles to persuade/follow the other team into a 3-way and then a 4-way meeting.
In general some symmetry-breaking is required, if only because both teams might be following my strategy and therefore both have a stationary member at different points.
Assume the park is a circle. (for the sake of clarity)
Group A
Person A.1
Person A.2
Group B
Person B.1
Person B.2
We (group A) are currently at the bottom of the circle (90 degrees). We agree to go towards 0 degrees in opposite directions. I'm person A.1 and I go clockwise. I send Person A.2. counterclockwise.
In any possible scenario (B splits, B doesn't split, B has the same scheme, B has some elaborate scheme), each group might have conflicting information. So unless Group A has a gun to force Group B into submission, the new groups might make conflicting choices upon meeting.
Say for instance, A.1. meets B.1, and A.2. meets B.2. What do we (A.1 and B.1) do if B has the same scheme? Since the new groups can't know what the other group decides (whether to go with A's scheme, or B's scheme), each group might make different decision.
And we'll end up where we started... (i.e. two people at 0 degrees, and two people at 90 degrees). Let's call this checkpoint "First Iteration".
We might account for this and say that we'll come up with a scheme for the "Second Iteration". But then the same thing happens again. And for the third iteration, fourth iteration, ad infinitum.
Each iteration has a 50% chance of not working out.
Which means that after x iterations, your chances of not meeting up at a common point are at most 1-(0.5^x)
N.B. I thought about a bunch of scenarios, such as Group A agreeing to come back to their initial point, and communicating with each other what Group B plans to do. But no cigar, turns out even with very clever schemes the conflicting information issue always arises.
An interesting problem indeed. I'd like to suggest my version of the solution:
0 Every group picks a leader.
1: Leader and followers go opposite directions
2: They meet other group leaders or followers
3: They keep going the same direction as before, 90 degrees magnitude
4: By this time, all groups have made a half-circle around the perimeter, and invariably have met leaders again, theirs, or others'.
5: All Leaders change the next step direction to that of the followers around,and order them to follow.
6: Units from all groups meet at one point.
Refer to the attached file for an in-depth explanation. You will need MS Office Powerpoint 2007 or newer to view it. In case you don't have one, use pptx. viewer (Powerpoint viewer) as a free alternative.
Animated Solution (.pptx)
EDIT: I made a typo in the first slide. It reads "Yellow and red are selected", while it must be "Blue and red" instead.
Each group will split in two parts, and each part will go around the circle in the opposite direction (clockwise and counterclockwise).
Before they start, they choose some kind of random number (in a range large enough so that there is no possibility for two groups to have the same number... or a Guid in computer science : globally unique identifier). So one unique number per group.
If people of the same group meet first (the two parts meet), they are alone, so probably the other groups (if any) gave up.
If two groups meet : they follow the rule that say the biggest number leads the way. So when they meet they continue in the direction that had people with the biggest number.
At the end, the direction of the biggest number will lead them all to one point.
If they have no computer to choose this number, each group could use the full names of the people of the group merged together.
Edit : sorry I just see that this is very close to Patrick Roberts' solution
Another edit : what if each group has its own deterministic strategy ?
In the solution above, all works well if all the groups have the same strategy. But in a real life problem this is not the case (as they cant communicate).
If one group has a deterministic strategy and the others have none, they can agree to follow the deterministic approach and all is ok.
But if two groups have deterministic approaches (simply for instance, the same as above, but one group uses the biggest number and the other group follows the lowest number).
Is there a solution to that ?

What is Youtube comment system sorting / ranking algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Youtube provides two sorting options: Newest first and Top comments. The "Newest first" is pretty simple that we just sort the comments by their post date. But the "Top comments" seems to be a lot more complex than just sorting by "thumb up"s.
After a short research, I found out that the order of comments depends on those things:
Number of "thumb up"s and "thumb down"s
Post date
Number of replies to that comment
But I don't know how Youtube uses this information to decide the order, like what information is more important and what is less important.
Is there any article about this topic that I could refer to?
Thanks!
I have the answer to your question.
After searching the internet for the answer to this, I never found precisely what I was looking for. So, my colleagues and I decided to experiment using the system with the Youtube comments.
First of all, we sorted what we believed to be popular videos into one section, average videos into another, and less popular into the last. There were 200 videos in each section, and after days of examining we started to notice a pattern. We found that you were right about the three things required, but we also dove a little deeper and found an additional variable.
The Youtube comment system depends on four things:
1) Time it was posted,
2) Like/dislike ratio of a comment,
3) Number of replies,
4) And, believe it or not, WHO posted it.
The average like/dislike ratio of every public comment you've ever posted builds into it, as (what we predicted) they believe that those with low like/dislike ratios would post comments that many people do not like or simply disagree with.
There is an algorithm to it, and it is quite simpler than you might think. Basically there are these things that we called "module points," and you get a certain one based on these four factors. First, here's the things you need to know about module point conversion with TWO of the factors:
For the like/dislike ratio on the comment, multiply that number by ten.
For the amount of replies (NOT from the original poster) that the comment has, there are two module points.
These are the two basic factors that tell the amount of module points the comment has.
For example, if a comment had 27 likes and 8 dislikes, then the ratio would be 3.375. Multiplying by 10, you would then have 33.75 module points. Using the next factor, amount of replies, let's say this comment has 4 direct replies to it. Multiplying 2 by 4, we get 8. This is the part where you add 8 onto the accumulative module points, giving you a total of 41.75 module points.
But we're not done here; this is where it gets tricky.
Using the average like/dislike ratio of a person's total comments that they've ever posted publicly, we found that the formula added onto the accumulative module points is this:
C = MP(R/3) + (MP/10)
where C = Comment Position Variable; MP = Module Points; R = Person's total like/dislike ratio
Trust me, we spend DAYS just on this part, which was probably the most frustrating. Even though the 3 and the 10 within this equation seem random and unnecessary, so far all of the comments we tested this equation on passed the test, but did not pass the test when those two variables were removed. After this equation is done, it gives you a number that we named to be the Position Variable.
However, we are not even done yet, we still haven't talked about time.
I was actually quite surprised that this part didn't take as long as I expected, but it sure was a pain doing this equation every single time for every comment we tested. At first, when testing it, we figured that the time was just there to break the barrier if 2 comments had equal Position Variables.
In fact, I almost called it a wrap on the experiment when this happened, but upon further inspection, we found out there was more to do. We found that some of the comments outranked each other that had the same Position Variable, but the timing seemed to be random! After a few days of inspection, here is where the final result comes in:
There is yet ANOTHER equation that we must find before applying the 4th variable. Using another separate equation, here's what our algebraic deductions came down to:
X = 1/3(S/10 + A) x [absolute value of](A - 3S)
where X = Timing Variable; S = How long ago the video was posted in minutes; A = How long ago the comment was posted in minutes
I wish I was making this up, but unfortunately this is how complicated the system is. There are mathematical reasons behind the other variables, but they are far too complex to explain, it will probably take up atleast three paragraphs worth of explaining. We tested this equation on more than 150 comments, all of them checked out to be true.
Once you find X, which is what we called the Timing Variable, all you have to do from here is apply it to this equation:
N = X(C/4 + 1)
where X = Timing Variable; C = Positioning Variable
N is the answer to all your problems.
This is the final equation, the final answer. The simple conclusion: the higher N, the higher up the comment is.
Note: Special thanks to my colleagues: David Mattison, Josh Williams, Diego Mendieta, Steven Orsette, and Kyle Shropshire. I could have never found out this without them and the work they put into this.

Information Retrieval - Adjancey Matrix Graph Sketch, Teleportation Probability, Calculate PageRank

I am doing a few thing on Information Retrieval and have an exam coming up and I am absolutely clueless. First of, could anyone recommend me the shortest and best description possible for what PageRank actually is in Information Retrieval? Maybe even a good short video or your own description. I know Google use to, or did use it.
I know there are a lot of questions here but I could use as MUCH help as possible in a short length of time.
So my first question (taken from past papers, and making my own examples):
I am wanting to take a table such as:
A B C
A 0 1 0
B 1 0 1
C 0 0 0
And create a graph. I believe this is correct but unsure (I could use a "yes that is correct" or a "no":
And if I was given a graph such as:
The table would be:
A B C
A 0 1 0
B 0 0 1
C 0 0 0
Is that correct? If not, could I please get help and get it described somewhere? The lecture I am reading is not great at explaining and my lecturer isn't great at helping either.
Next I will probably be asked to use Teleportation Probability on the first table. This I desperately need help in. If the probability(the special a symbol)=1/2, does this mean multiply everything, including the 0's in the table such as 0x1/2? also 1x1/2? This is for the matrix of transition probabilities.
Next would be, how can I calculate PageRank from the above matrix. Using matrix multiplication. In words or in Pseudocode.
Another question I want to know is, will a user's page rank on twitter increase if they follow another user? I was assuming this would be a no because they are not following the user back?
Does a user's pagerank depend on how frequently you find said user if you start at a random user and click on another random persona and such till you find them? I assume this one is definitely not true. Because they might not be following said user.
I know this is a lot to ask. Does anyone have tutorials I can follow for either that are not complicated and I can look at and get it mastered today?
Thanks I really appreciate all your help. I know not one person can answer them all but can help provide assistance for some.
here's my stab at answering your questions:
good learning resource:
http://en.wikipedia.org/wiki/PageRank#Simplified_algorithm (no doubt you've see it already, but it's a pretty good one). Start there, understand the algorithm first, then do the implementation.
this might be a good simple method to implement?
http://pr.efactory.de/e-pagerank-algorithm.shtml
or this:
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
I'm guessing you can program in Python (common school language), in that case you might be interested in a package for handling graphs which has pagerank calculations: http://networkx.lanl.gov/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html. If you have to write your own pagerank algorithm (very doable), you could use that to check the results.
For the matrix -> graph conversion question: your professor needs to specify how directionality is encoded in the matrix. Does a 1 at B,C specify a link from B to C or from C to B? My guess would be B to C. If that's true, your first graph is wrong there, but the second graph is ok. Directionality is very important in PageRank.
I believe the Teleportation probability is the probability that a random walker executing a new step will jump to a random node in the graph. It's in the wikipedia page under "damping factor". I don't know how it ties into multiplying numbers in your matrix.
For the Twitter question - yes, I think you have it right. Linking to (or presumably following) a second person does nothing directly to the the first person's pagerank, but it likely increases the second person's pagerank. In practice, there could be secondary effects, like the second person noticing that the first person is interesting and following them back.
second to last question - yes, one formulation of the pagerank algorithm is as a random walk along links with the frequency of encountering a node (page) going into the pagerank.
good luck!

Find most recent & closest posts, limit 20

I saw a question here recently and bookmark it for further thought. This is the question. What I can't determine myself is if this question is really interesting or nothing special?
Why this is, its because it looked to me that it had a real simple answer sort by lowest distance*time product, or am I missing something obvious?
I can explain the reason why it looked simple to me:
Distance is always somewhat constant no matter when or where the query is ran, meaning that if: My home is at point A and there is a post at point B and another post at point C, no matter when I ran the query I will always get the constant values say 5km & 7km.
The time offset since the post looks like it's also somewhat constant in a sense that it grows equally for all posts. Meaning that if post B is from 2004 and post C is from 2009, now they will be 7 years and 2 years ago respectively. So next year it will be 8 and 3 years ago and so on.
Adding a weight value(s) to 'tweak' the distance & time is not any helpful (not needed) since (taking the values from the two post above) 5*7*alpha will always be more then 2*7*aplha hence no matter when we ran the query post C (2*7*aplha) will always be the 'closest most recent'
Also adding a weight constant to 'tweak' the results seems like it's no longer going to product the most closest and recent but will favor either or in which case I may as well sort by most recent and then by most closest or vise versa. But this is no longer the closest more recent but either the closest then more recent or more recent then closest so both those questions are trivial I believe. So this is why I think tweaking is not a good idea no matter what units are chosen to represent the time offset and distance.
Addition doesn't work as well as multiplication I think but distance*time seems to be sufficient to always get the correct result.
So this is what I was thinking but then I thought, no that can't be that simple. So what am I missing here?
The best way to determine the desired sorting expression would be to let some human beings sort some items manually and deduce the expressions from their answers. It may well be that different persons would give different answers, so that one single expression can't accommodate everyone.
There are other useful polynomial expressions such as t*d + A*t + B*d, where t and d are time and distance. Maybe more precise results can be achieved if we introduce one more polynomial degree, so that expression becomes t*d + A*t*t + B*d*d + C*t + D*d. Only from answers of real humans can you devise this formula.

What class of algorithms can be used to solve this?

EDIT: Just to make sure someone is not breaking their head on the problem... I am not looking for the best optimal algorithm. Some heuristic that makes sense is fine.
I made a previous attempt at formulating this and realized I did not do a great job at it so I removed that question. I have taken another shot at formulating my problem. Please feel free to provide any constructive criticism that can help me improve this.
Input:
N people
k announcements that I can make
Distance that my voice can be heard (say 5 meters) i.e. I may decide to announce or not depending on the number of people within these 5 meters
Goal:
Maximize the total number of people who have heard my k announcements and (optionally) minimize the time in which I can finish announcing all k announcements
Constraints:
Once a person hears my announcement, he is be removed from the total i.e. if he had heard my first announcement, I do not count him even if he hears my second announcement
I can see the same person as well as the same set of people within my proximity
Example:
Let us consider 10 people numbered from 1 to 10 and the following pattern of arrival:
Time slot 1: 1 (payoff = 1)
Time slot 2: 2 3 4 5 (payoff = 4)
Time slot 3: 5 6 7 8 (payoff = 4 if no announcement was made previously in time slot 2, 3 if an announcement was made in time slot 2)
Time slot 4: 9 10 (payoff = 2)
and I am given 2 announcements to make. Now if I were an oracle, I would choose time slots 2 and time slots 3 because then 7 people would have heard (because 5 already heard my announcement in Time slot 2, I do not consider him anymore). I am looking for an online algorithm that will help me make these decisions on whether or not to make an announcement and if so based on what factors. Does anyone have any ideas on what algorithms can be used to solve this or a simpler version of this problem?
There should be an approach relying upon a max-flow algorithm. In essence, you're trying to push the maximum amount of messages from start->end. Though it would be multidimensional, you could have a super-sink, which connects to each value of t, then have each value of t connect to the people you can reach at this time and then have a super-sink. This way, you simply have to compute a max-flow (with the added constraint of no more than k shouts, which should be solvable with a bit of dynamic programming). It's a terrifically dirty way to solve it, but it should get the job done deterministically and without the use of heuristics.
I don't know that there is really a way to solve this or an algorithm to do it the way you have formulated it.
It seems like basically you are trying to reach the maximum number of people with exactly 2 announcements. But without knowing any information about the groups of people in advance, you can't really make any kind of intelligent decision about whether or not to use your first announcement. Your second one at least has the benefit of knowing when not to be used (i.e. if the group has no new members then you can know its not worth wasting the announcement). But it still has basically the same problem.
The only real way to solve this is to use knowledge about the type of data or the desired outcome to make guesses. If you know that groups average 100 people with a standard deviation of 10, then you could just refuse to announce if less than 90 people are present. Or, if you know you need to reach at least 100 people with two announcements, you could choose never to announce to less than 50 at once. Obviously those approaches risk never announcing at all if the actual data does not meet what you would expect. But that's always going to be a risk, since you could get 1 person in the first group and then 0 in all of the rest, no matter what you do.
Or, you could try more clearly defining the problem, I have a hard time figuring out how to relate this to computers.
Lets start my trying to solve the simplest possible variant of the problem: Lets assume N people and K timeslots, but only one possible announcement. Lets also assume that each person will only ever stay for one timeslot and that each person who hasn't yet shown up has an equally probable chance of showing up at any future timeslot.
Given these simplifications, at each timeslot you look at the payoff of announcing at the current timeslot and compare to the chance of a future timeslot having a higher payoff, eg, lets assume 4 people 3 timeslots:
Timeslot 1: Person 1 shows up, so you know you could get a payoff of 1 by announcing, but then you have 3 people to show up in 2 remaining timeslots, so at least one of those timeslots is guaranteed to have 2 people, so don't announce..
So at each timeslot, you can calculate the chance that a later timeslot will have a higher payoff than the current by treating the remaining (N) people and (K) timeslots as being N independent random numbers each from 1..k, and calculate the chance of at least one value k being hit more than or equal to the current-payoff times. (Similar to the Birthday problem, but for more than 1 collision) and then you need to decide hwo much to discount based on expected variances. (bird in the hand, etc)
Generalization of this solution to the original problem is left as an exercise for the reader.

Resources