Interview stumper: friends of friends of friends - algorithm

Suppose you have a social network with a billion users. On each user's page, you want to display the number of that users friends, friends of friends, and so on, out to five degrees. Friendships are reciprocal. The counts don't need to update right away, but they should be precise.
I read up on graphs, but I didn't find anything that suggested a scalable approach to this problem. Anything I could think of would take way too much time, way too much space, or both. This is driving me nuts!

One interesting approach is to translate the friend graph into an adjacency matrix, and then raise the matrix to the 5th power. This gives you an adjacency matrix containing counts of the number of paths-of-length-5 between each node.
Note that you'll want a matrix multiplication algorithm that can take advantage of sparse matrices, since the friends adjacency matrix is likely to be sparse for the first couple levels. Lucky for you, people have a done a lot of work on how to multiply huge matrices (especially sparse ones) efficiently.
Here's a video where Twitter's Oscar Boykin mentions this approach for computing followers of followers at Twitter.

It seems to me that the problem really comes down to how we hash/track 1 Billion users as we are counting the friends at each level. (Note that we only need to count them, NOT store them)
If we assume that for each person, their friend and the friends of their friends are of very small order (say <1000 and <100,000) it seems practical to keep these stored in database tables for each user. It only requires two manageable passes of the entire database and then straight-forward additions to the tables when a "new" relationship is created.
If we have 1st and 2nd degree friend stored in a users tables we can leverage those to extend as far as we need to -
EG: to COUNT 3rd degree friend we need to hash and track the 1st degree friends of all the 2nd degree friends. (for 4th degree you do all 2nd's of Seconds, for higher degrees you create the 4th and then extend appropriately to 5th or 6th).
So, at that point (5th and 6th degree friends) you are starting to approach 1 Billion as the number of people that you need to track, hash and count.
I'm thinking that the problem then becomes, what is the most efficient way to has 1 billion record-ID's as you "count" the friends in the higher order relationships.
How you do that, I don't know - any thoughts?

Related

Algorithm for matching people together based on likes and dislikes

I have a group of about 75 people. Each user has liked or disliked the other 74 users. These people need to be divided in about 15 groups of various sizes (4 tot 8 people). They need to be grouped together so that the groups consist only of people who all liked eachother, or at least as much as possible.
I'm not sure what the best algorithm is to tackle this problem. Any pointers or pseudo code much appreciated!
This isn't formed quite well enough to suggest a particular algorithm. I suggest clustering and "clique" algorithms, but you'll still need to define your "best grouping" metric. "as much as possible", in the face of trade-offs and undefined desires, is meaningless. Your clustering algorithm will need this metric to form your groups.
Data representation is simple: you need a directed graph. An edge from A to B means that A likes B; lack of an edge means A doesn't like B. That will encode the "likes" information in a form tractable to your algorithm. You have 75 nodes and one edge for every "like".
Start by researching clique algorithms; a "clique" is a set in which every member likes every other member. These will likely form the basis of your clustering.
Note, however, that you have to define your trade-offs. For instance, consider the case of 13 nodes consisting of two distinct cliques of 4 and 8 people, plus one person who likes one member of the 8-clique. There are no other "likes" in the graph.
How do you place that 13th person? Do you split the 8-clique and add them to the group with the person they like? If so, you do split off 3 or 4 people form the 8? Is it fair to break 15 or 16 "likes" to put that person with the one person they like -- who doesn't like them? Is it better to add the 13th person to the mutually antagonistic clique of 4?
Your eval function must return a well-ordered metric for all of these situations. It will need to support adding to a group, splitting a large group, etc.
It sounds like a clustering problem.
Each user is a node. If two users liked each other, there is a path between the nodes.
If users disliked each other, or one like another but not the other way around, then there is no path between those nodes.
Once you process the like information into a graph, you will get a connected graph (maybe some nodes will be isolated if no one likes that user). Now the question becomes how to cut that graph into clusters of 4-8 connected nodes, which is a well studied problem with a lot of possible algorithms:
https://www.google.com/search?q=divide+connected+graph+into+clusters
If you want to differentiate between the cases when two people dislike each other vs one person likes another and that person dislikes the first one, than you can also introduce weight on the path - each like is +1, and dislike is -1. Then the question becomes that of partitioning a weighted graph.

Efficient Algorithm for finding closest match in a graph

Edit: to include concrete explanation of my problem (as correctly deduced by Billiska):
"Set A is the set of users. set B is the set of products. each user rates one or more products. the rating is 1 to 10. you want to deduce for each user, who is the other user that has the most similar taste to him."
"The other half is choosing how exactly do you want to rank similarity of A-elements." - this is also part of my problem. I feel that users who have rated similarly across the most products have the closed affinity, but at the same time I want to avoid user1 and user2 with many mediocre matches being matched ahead of user1 and user3 who have just a few very good matches (perhaps I need a non-linear score).
Disclaimer: I have never used a graph database.
I two sets of data A and B. A has a relationship with zero to many Bs. Each relationship has a fixed value.
e.g.
A1--5-->B10
A1--1-->B1000
So my initial thought "Yay, thats a graph, time to learn about graph databases!" but before I get too carried away.... the only reason for doing this so that I can answer the question....
For each A find the set of As that are most similar based on their weights, where I want to take in to consideration
the difference in weights (assuming 1 to 10) so that 10 and 10 is scored higher than 10 and 1; but then I have an issue with how to handle where is is no pairing (or do I - I am just not sure)
the number of vertices (ignoring weights) that two sets have in common. Intention is to rank two As with lots of vertices to the same Bs higher than two As that have just a single matching vertices.
What would the best approach be to doing this?
(Supplementary - as I realise this may count a second question): How would that approach change if the set of A was in the millions and B in the 100 thousands and I needed real-time answers?
Not a complete answer. I don't fully understand the technique either. but I know it's very relevant.
If you view the data as a matrix. e.g. have the rows correspond to set A, have the columns correspond to set B, and the entries are the weight.
Then it's a matrix with some missing values.
One technique used in recommender system (under the category of collaborative filtering) is low-rank approximation.
It's based on the assumption that the said user-product rating matrix usually have low-rank.
In a rough sense, the said matrix have low-rank if the rows of many users could be expressed as linear combination of other users' row.
I hope this would give a start for further reading.
Yes, you could see in low-rank approximation wiki page that the technique can be used to guess the missing entries (the missing rating). I know it's a different problem, but related.

Optimally reordering cards in a wallet?

I was out buying groceries the other day and needed to search through my wallet to find my credit card, my customer rewards (loyalty) card, and my photo ID. My wallet has dozens of other cards in it (work ID, other credit cards, etc.), so it took me a while to find everything.
My wallet has six slots in it where I can put cards, with only the first card in each slot initially visible at any one time. If I want to find a specific card, I have to remember which slot it's in, then look at all the cards in that slot one at a time to find it. The closer it is to the front of a slot, the easier it is to find it.
It occurred to me that this is pretty much a data structures question. Suppose that you have a data structure consisting of k linked lists, each of which can store an arbitrary number of elements. You want to distribute elements into the linked lists in a way that minimizes looking up. You can use whatever system you want for distributing elements into the different lists, and can reorder lists whenever you'd like. Given this setup, is there an optimal way to order the lists, under any of the assumptions:
You are given the probabilities of accessing each element in advance and accesses are independent, or
You have no knowledge in advance what elements will be accessed when?
The informal system I use in my wallet is to "hash" cards into different slots based on use case (IDs, credit cards, loyalty cards, etc.), then keep elements within each slot roughly sorted by access frequency. However, maybe there's a better way to do this (for example, storing the k most frequently-used elements at the front of each slot regardless of their use case).
Is there a known system for solving this problem? Is this a well-known problem in data structures? If so, what's the optimal solution?
(In case this doesn't seem programming-related: I could imagine an application in which the user has several drop-down lists of commonly-used items, and wants to keep those items ordered in a way that minimizes the time required to find a particular item.)
Although not a full answer for general k, this 1985 paper by Sleator and Tarjan gives a helpful analysis of the amortised complexity of several dynamic list update algorithms for the case k=1. It turns out that move-to-front is very good: assuming fixed access probabilities for each item, it never requires more than twice the number of steps (moves and swaps) that would be required by the optimal (static) algorithm, in which all elements are listed in nonincreasing order of probability.
Interestingly, a couple of other plausible heuristics -- namely swapping with the previous element after finding the desired element, and maintaining order according to explicit frequency counts -- don't share this desirable property. OTOH, on p. 2 they mention that an earlier paper by Rivest showed that the expected amortised cost of any access under swap-with-previous is <= the corresponding cost under move-to-front.
I've only read the first few pages, but it looks relevant to me. Hope it helps!
You need to look at skip lists. There is a similar problem with arranging stations for a train system where there are express trains and regular trains. An express train stops only at express stations while regular trains stop at regular stations and express stations. Where should the express stops be placed so that one can minimize the average number of stops when travelling from a start station to any station.
The solution is to use stations at ternary numbers (i.e., at 1, 3, 6, 10 etc where T_n = n * (n + 1) / 2).
This is assuming all stops (or cards) are equally likely to be accessed.
If you know the access probabilities of your n cards in advance and you have k wallet slots and accesses are independent, isn't it fairly clear that the greedy solution is optimal? That is, the most frequently-accessed k cards go at the front of the pockets, next-most-frequently accessed k go immediately behind, and so forth? (You never want a lower-probability card ranked before a higher-probability card.)
If you don't know the access probabilities, but you do know they exist and that card accesses are independent, I imagine sorting the cards similarly, but by number-of-accesses-seen-so-far instead is asymptotically optimal. (Move-to-front is cool too, but I don't see an obvious reason to use it here.)
Perhaps you get something interesting if you penalise card moves as well; if I have any known probability distribution on card accesses, independent or not, I just greedily re-sort the cards every time I do an access.

Grouping individuals into families

We have a simulation program where we take a very large population of individual people and group them into families. Each family is then run through the simulation.
I am in charge of grouping the individuals into families, and I think it is a really cool problem.
Right now, my technique is pretty naive/simple. Each individual record has some characteristics, including married/single, age, gender, and income level. For married people I select an individual and loop through the population and look for a match based on a match function. For people/couples with children I essentially do the same thing, looking for a random number of children (selected according to an empirical distribution) and then loop through all of the children and pick them out and add them to the family based on a match function. After this, not everybody is matched, so I relax the restrictions in my match function and loop through again. I keep doing this, but I stop before my match function gets too ridiculous (marries 85-year-olds to 20-year-olds for example). Anyone who is leftover is written out as a single person.
This works well enough for our current purposes, and I'll probably never get time or permission to rework it, but I at least want to plan for the occasion or learn some cool stuff - even if I never use it. Also, I'm afraid the algorithm will not work very well for smaller sample sizes. Does anybody know what type of algorithms I can study that might relate to this problem or how I might go about formalizing it?
For reference, I'm comfortable with chapters 1-26 of CLRS, but I haven't really touched NP-Completeness or Approximation Algorithms. Not that you shouldn't bring up those topics, but if you do, maybe go easy on me because I probably won't understand everything you are talking about right away. :) I also don't really know anything about evolutionary algorithms.
Edit: I am specifically looking to improve the following:
Less ridiculous marriages.
Less single people at the end.
Perhaps what you are looking for is cluster analysis?
Lets try to think of your problem like this (starting by solving the spouses matching):
If you were to have a matrix where each row is a male and each column is a female, and every cell in that matrix is the match function's returned value, what you are now looking for is selecting cells so that there won't be a row or a column in which more than one cell is selected, and the total sum of all selected cells should be maximal. This is very similar to the N Queens Problem, with the modification that each allocation of a "queen" has a reward (which we should maximize).
You could solve this problem by using a graph where:
You have a root,
each of the first raw's cells' values is an edge's weight leading to first depth vertices
each of the second raw's cells' values is an edge's weight leading to second depth vertices..
Etc.
(Notice that when you find a match to the first female, you shouldn't consider her anymore, and so for every other female you find a match to)
Then finding the maximum allocation can be done by BFS, or better still by A* (notice A* typically looks for minimum cost, so you'll have to modify it a bit).
For matching between couples (or singles, more on that later..) and children, I think KNN with some modifications is your best bet, but you'll need to optimize it to your needs. But now I have to relate to your edit..
How do you measure your algorithm's efficiency?
You need a function that receives the expected distribution of all states (single, married with one children, single with two children, etc.), and the distribution of all states in your solution, and grades the solution accordingly. How do you calculate the expected distribution? That's quite a bit of statistics work..
First you need to know the distribution of all states (single, married.. as mentioned above) in the population,
then you need to know the distribution of ages and genders in the population,
and last thing you need to know - the distribution of ages and genders in your population.
Only then, according to those three, can you calculate how many people you expect to be in each state.. And then you can measure the distance between what you expected and what you got... That is a lot of typing.. Sorry for the general parts...

What algorithm should I use for "genetic AI improvement"

First of all: This is not a question about how to make a program play Five in a Row. Been there, done that.
Introductory explanation
I have made a five-in-a-row-game as a framework to experiment with genetically improving AI (ouch, that sounds awfully pretentious). As with most turn-based games the best move is decided by assigning a score to every possible move, and then playing the move with the highest score. The function for assigning a score to a move (a square) goes something like this:
If the square already has a token, the score is 0 since it would be illegal to place a new token in the square.
Each square can be a part of up to 20 different winning rows (5 horizontal, 5 vertical, 10 diagonal). The score of the square is the sum of the score of each of these rows.
The score of a row depends on the number of friendly and enemy tokens already in the row. Examples:
A row with four friendly tokens should have infinite score, because if you place a token there you win the game.
The score for a row with four enemy tokens should be very high, since if you don't put a token there, the opponent will win on his next turn.
A row with both friendly and enemy tokens will score 0, since this row can never be part of a winning row.
Given this algorithm, I have declared a type called TBrain:
type
TBrain = array[cFriendly..cEnemy , 0..4] of integer;
The values in the array indicates the score of a row with either N friendly tokens and 0 enemy tokens, or 0 friendly tokens and N enemy tokens. If there are 5 tokens in a row there's no score since the row is full.
It's actually quite easy to decide which values should be in the array. Brain[0,4] (four friendly tokens) should be "infinite", let's call that 1.000.000. vBrain[1,4] should be very high, but not so high that the brain would prefer blocking several enemy wins rather than wining itself
Concider the following (improbable) board:
0123456789
+----------
0|1...1...12
1|.1..1..1.2
2|..1.1.1..2
3|...111...2
4|1111.1111.
5|...111....
6|..1.1.1...
7|.1..1..1..
8|1...1...1.
Player 2 should place his token in (9,4), winning the game, not in (4,4) even though he would then block 8 potential winning rows for player 1. Ergo, vBrain[1,4] should be (vBrain[0,4]/8)-1. Working like this we can find optimal values for the "brain", but again, this is not what I'm interested in. I want an algorithm to find the best values.
I have implemented this framework so that it's totally deterministic. There's no random values added to the scores, and if several squares have the same score the top-left will be chosen.
Actual problem
That's it for the introduction, now to the interesting part (for me, at least)
I have two "brains", vBrain1 and vBrain2. How should I iteratively make these better? I Imagine something like this:
Initialize vBrain1 and vBrain2 with random values.
Simulate a game between them.
Assign the values from the winner to the loser, then randomly change one of them slightly.
This doesn't seem work. The brains don't get any smarter. Why?
Should the score-method add some small random values to the result, so that two games between the same two brains would be different? How much should the values change for each iteration? How should the "brains" be initialized? With constant values? With random values?
Also, does this have anything to do with AI or genetic algorithms at all?
PS: The question has nothing to do with Five in a Row. That's just something I chose because I can declare a very simple "Brain" to experiment on.
If you want to approach this problem like a genetic algorithm, you will need an entire population of "brains". Then evaluate them against each other, either every combination or use a tournament style. Then select the top X% of the population and use those as the parents of the next generation, where offspring are created via mutation (which you have) or genetic crossover (e.g., swap rows or columns between two "brains").
Also, if you do not see any evolutionary progress, you may need more than just win/loss, but come up with some kind of point system so that you can rank the entire population more effectively, which makes selection easier.
Generally speaking, yes you can make a brain smarter by using genetic algorithms techniques.
Randomness, or mutation, plays a significant part on genetic programming.
I like this tutorial, Genetic Algorithms: Cool Name & Damn Simple.
(It uses Python for the examples but it's not difficult to understand them)
Take a look at Neuro Evolution of Augmenting Tologies (NEAT). A fancy acronymn which basically means the evolution of neural nets - both their structure (topology) and connection weights. I wrote a .Net implementation called SharpNEAT that you may wish to look at. SharpNEAT V1 also has a Tic-Tac-Toe experiment.
http://sharpneat.sourceforge.net/

Resources