Algorithm: find friends from a list of users - algorithm

Scenario: in my app, users could follow a post. They will get notified whenever their friends liked the post. The problem gets nontrivial when there are thousands of users following and like a post.
My current approach is simply, when a new user likes a post, iterate through all the users who have followed the post, and check whether the new user exists in their friend list (let's say the average size is N). I indexed the friend list, so the look up is O(logN), which means for each new like, the computation is O(klogN) if there are k people following the post and directly since there are k users who like it, then the overall complexity becomes O(k^2logN). Can I do better than this?
Note:
Notification does not have to be instant, nor does it have to happen 100% of the times
Posts are created by users
I am using Firestore, a NoSQL database, if that matters

What you need to use is a hybrid approach. Take advantage of the fact that the users friend list might be shorter than the number of followers, or vice versa. There are two options:
Do what you do now, and check every follower against the new user's friends list. The time complexity reflects the number of followers.
Do the reverse, and check every friend of the user against the followers list for the post. The time complexity reflects the number of friends of the user.
Armed with these tactics, now we design an algorithm to check which of the two will give the better performance.
Keep an active count of the number of friends of each user, and of the followers of a post. When someone likes a post, if they have fewer friends than there are people that liked the post, its faster to check if each friend is in the followers list (use a self-balancing BST or hash table in the implementation). If there are fewer followers than the user has friends, the reverse would be faster.
If there are N followers, K users liking the post, and F friends, the checking friend->follower would give a run time of O(N*F*log(K)), and follower->friend would be O(N*K*log(F)). The worst case still remains the same, however if you were only concerned about theoretical time bounds then you could substitute your index table with a hash table anyways which is O(1) instead of O(log(n)) anyways.

I think it can be improved to N^2 + k logN^2 by using more memory space. This problem is fundamentally that of finding intersection of two sets (the set of friends of the new liked user and the set of followers, OR the set of friends of the followers and the the set of liked users). Since look up is cheap, we want to make the set to be looked up be as big as possible. So if we put all the friends of all the followers into one big set (more specifically a map) of size N^2, it becomes k logN^2 if there are k liked users, plus initial iteration of N^2
Additional benefit of aggregating friends together is that many users have mutual friends, so actual size may be smaller than N^2

Related

Given N users with movies preferences, retrieve a list of movies preferred by at least K users

Given N users with movies preferences, retrieve a list of movies preferred by at least K users.
What's the most efficient [Run-time / Memory] algorithm to find that answer?
If N=K it's easy, since you could:
Intersection = First user preferences.
For the rest of the users:
Intersection = intersect(Intersection, user_i)
If intersection is empty there's no point to continue.
(4) is the problematic in any other case, since even if there's no intersection, there's still 'potential'
I thought to create an hash-map to "Count" the amount of intersections per movie preference, but it sounds pretty inefficient. Especially if the movie preferences is huge.
Any ideas / hints? Thanks.
If you want to optimize to run time, your approach of creating a histogram is a good one. Basically, run over all data, and map:movie->#users. Then, a single iteration on the map gives you the list of movies liked by k+ users.
This is O(N+k) time and O(k) memory.
Note that this approach can be efficiently distributed using map-reduce.
map:(user,movie)->(movie,1)
reduce:(movie,list<int>)->movie if sum(list)>k else none
If you want to do so with minimal added memory, you can use some inplace sorting algorithm of your data, by movie name. Then, iterate the data and count how many times a movie repeats itself, if it's k or more, yield it.
This is O(NlogN) run time, with minimal added memory.
In both solutions, N stands for the input size (number of entries), which is potentially O(n*k), but practically much less.

What data structure to use in maintaining k most frequently dialed numbers in a phone?

I was asked this question in an interview, "How to maintain k most frequent dialed numbers in a phone ?". So what kind of data structure to use in this case ?
The tasks are:
Keep track of the #times each number is dialed;
Keep track of top counted k numbers.
So, you'll have to use Augmented DS. In your case, this will be HashSet and PriorityQueue (aka Heap) of size k with minimum dialed number at the top.
Since the number of times a number has been dialed can only increase, this makes our job a bit easier in the sense that you will never have to pull a number out of Heap because its count went down. Instead you will only add a number that has been dialed and then remove the top of Heap because top is the least dialed number.
The class PhoneNumber would contain:
the phone number;
the count of times it has been dialed; and,
a boolean to tell whether it is in top-k number or not.
General steps would be:
Whenever a number is dialed:
Add it to HashSet if it has never been dialed before with a count of dials = 1 and the boolean tracking its presence in the heap to true;
If it is already present in the HashSet, increase its dial count by 1 making sure the hashing function independent of dialing counts (otherwise you will not be able to retrieve the number back from HashSet);
If the number is in Heap already (which you can know by the boolean in PhoneNumber object), increase its count and heapify() the heap again;
If the number is not in Heap, add the number to the heap and then remove the top, setting the trakcing boolean of the removed number to false. This will ensure that top-k dialed numbers only are present in the heap;
Make sure you don't remove the numbers until the heap's size = k.
Space complexity: O(n) for the n numbers dialed until now, stored in HashSet and referenced in Heap.
Time Complexity: O(k*Log(k)) O(k + Log(k)) for each dialing of number because you have to heapify at each new dial. Since the rearrangement of keys will be done for only 1 number in the worst case, you iterate over k numbers and then sometimes do a Log(k) work for exactly one number. O(1) would be the complexity for getting the k top dialed numbers as they are right there in your heap.
Priority Queue (which is an implementation of a MaxHeap)
A max heap which is known as Priority Queue in many programming languages can be used where each entry will <phone_number, count_of_dial> pair. The max heap will be sorted according to the count_of_dial. Top k items are the answer.
The purpose of this question is twofold:
To get you to ask questions.
To get you to talk through drawbacks and advantages of different approaches.
The interviewer isn't terribly interested in you getting the "right" answer, as much as he's interested in how you approach the problem. For example, the problem as stated is not well specified. You probably should have asked questions like:
Most frequent over what period? All time? Per month? Year to date?
How will this information be used?
How frequently will the information be queried?
How fast does response have to be?
How many phone numbers do you expect it to handle?
Is k constant? Or will users ask at one point for the top 10, and some other time for the top 100?
All of these questions are relevant to solving the problem.
Once you know all of the requirements, then you can start thinking about how to implement a solution. It could be as simple as maintaining a call count with every phone entry. Then, when the data is queried, you run a simple heap selection algorithm on the entire phone list, picking the top k items. That's a perfectly reasonable solution if queries are infrequent. The typical phone isn't going to have a huge number of called numbers, so this could work quite well.
Another possible solution would be to maintain the call count for each number, and then, after every call, run the heap selection algorithm and cache the result. The idea here is that the data can only update when a new call is made, and calls are very infrequent, in terms of computer time. If you could make a call every 15 seconds (quite unlikely), that's only 5,760 calls in a day. Even the slowest phone should be able to keep up with that.
There are other solutions, all of which have their advantages and disadvantages. Balancing memory use, CPU resources, simplicity, development time, and many other factors is a large part of what we do as software developers. The interviewer purposely under-specified a problem with a seemingly straightforward solution in order to see how you approach things.
If you did not do well on the interview, learn from it. Also understand that the interviewer probably thought you were technically competent, otherwise he wouldn't have moved on to see how well you approach problems. You're expected to ask questions. After all, anybody can regurgitate simple solutions to simple problems. But in the real world we don't get simple problems, and the people asking us to do things don't always fully specify the requirements. So we have to learn how to extract the requirements from them: usually by asking questions.
I'd use a structure where I have the number and how many time it was dialed. Would put that in a B-Tree organizing it according to the number of times the numbers was dialed.
Add O(log(n)) [Balanced]
Add O(n) [NOT balanced]
Search O(log(n))
Balance O(log(n))
Add(not balanced) + balance O(log(n))
IN THE WORS CASE: Searching + Adding + balancing it would be O(n). The avarage complexity of all operation in a B-Tree is still O(log(n).
A B-Tree grows by the root not by the leaves, so it'll guarantee that is balanced all the time, since you keep control of it when inserting by splitting down the nodes and moving values down.
In this specific case where I don't have to forget the number that were once dialed is even simpler.
The advantage of it is that it'll be always ordered, so what you are looking for would be the 50 first "nodes(key/value)" of the tree.

Data Structure search record

The question asked by interview Panel..Web application user can select the favorite sports, one user has a number of Favourite sports.e.g User John has Favourites sport as Football, Soccer, Tennis. User Alen has favorites sport as BaseBall,BasketBall.
Consider Million of users, Which algorithm in Data structure uses to search users associated with Football or scoccer.
First I gave an answer as HashMap but interview panel told me it cause Memory issue, another way I can use Binary Search tree, but he is not satisfied with the answer.
Can anyone please explain to me what is a good way to get all users with favorite sport using DS algorithm.
The easiest solution is to use HashMap by mapping user as key and list of sports as value as you mentioned. And its very common to come with this solution first at interview to see if it satisfies the interviewers.
A better solution can be building a graph of users and sports where there will be uni-directional edge from sport item node i.e. football to user nodes i.e. foo, bar. For any query for sport item i.e. football, we can traverse the graph considering football as source node. All the nodes which will be traversed are the list of users whose one of the favorite sports is football. This way, it will space efficient.
Considering the time complexity of traversing the graph for each query, where time complexity of graph traversal is O(E) where E = all users in worst case. So, we can cache some frequent query result using HashMap. Again, to cope with space, we can simulate LRU cache with HashMap.
Hope it helps!

Interview stumper: friends of friends of friends

Suppose you have a social network with a billion users. On each user's page, you want to display the number of that users friends, friends of friends, and so on, out to five degrees. Friendships are reciprocal. The counts don't need to update right away, but they should be precise.
I read up on graphs, but I didn't find anything that suggested a scalable approach to this problem. Anything I could think of would take way too much time, way too much space, or both. This is driving me nuts!
One interesting approach is to translate the friend graph into an adjacency matrix, and then raise the matrix to the 5th power. This gives you an adjacency matrix containing counts of the number of paths-of-length-5 between each node.
Note that you'll want a matrix multiplication algorithm that can take advantage of sparse matrices, since the friends adjacency matrix is likely to be sparse for the first couple levels. Lucky for you, people have a done a lot of work on how to multiply huge matrices (especially sparse ones) efficiently.
Here's a video where Twitter's Oscar Boykin mentions this approach for computing followers of followers at Twitter.
It seems to me that the problem really comes down to how we hash/track 1 Billion users as we are counting the friends at each level. (Note that we only need to count them, NOT store them)
If we assume that for each person, their friend and the friends of their friends are of very small order (say <1000 and <100,000) it seems practical to keep these stored in database tables for each user. It only requires two manageable passes of the entire database and then straight-forward additions to the tables when a "new" relationship is created.
If we have 1st and 2nd degree friend stored in a users tables we can leverage those to extend as far as we need to -
EG: to COUNT 3rd degree friend we need to hash and track the 1st degree friends of all the 2nd degree friends. (for 4th degree you do all 2nd's of Seconds, for higher degrees you create the 4th and then extend appropriately to 5th or 6th).
So, at that point (5th and 6th degree friends) you are starting to approach 1 Billion as the number of people that you need to track, hash and count.
I'm thinking that the problem then becomes, what is the most efficient way to has 1 billion record-ID's as you "count" the friends in the higher order relationships.
How you do that, I don't know - any thoughts?

Optimally reordering cards in a wallet?

I was out buying groceries the other day and needed to search through my wallet to find my credit card, my customer rewards (loyalty) card, and my photo ID. My wallet has dozens of other cards in it (work ID, other credit cards, etc.), so it took me a while to find everything.
My wallet has six slots in it where I can put cards, with only the first card in each slot initially visible at any one time. If I want to find a specific card, I have to remember which slot it's in, then look at all the cards in that slot one at a time to find it. The closer it is to the front of a slot, the easier it is to find it.
It occurred to me that this is pretty much a data structures question. Suppose that you have a data structure consisting of k linked lists, each of which can store an arbitrary number of elements. You want to distribute elements into the linked lists in a way that minimizes looking up. You can use whatever system you want for distributing elements into the different lists, and can reorder lists whenever you'd like. Given this setup, is there an optimal way to order the lists, under any of the assumptions:
You are given the probabilities of accessing each element in advance and accesses are independent, or
You have no knowledge in advance what elements will be accessed when?
The informal system I use in my wallet is to "hash" cards into different slots based on use case (IDs, credit cards, loyalty cards, etc.), then keep elements within each slot roughly sorted by access frequency. However, maybe there's a better way to do this (for example, storing the k most frequently-used elements at the front of each slot regardless of their use case).
Is there a known system for solving this problem? Is this a well-known problem in data structures? If so, what's the optimal solution?
(In case this doesn't seem programming-related: I could imagine an application in which the user has several drop-down lists of commonly-used items, and wants to keep those items ordered in a way that minimizes the time required to find a particular item.)
Although not a full answer for general k, this 1985 paper by Sleator and Tarjan gives a helpful analysis of the amortised complexity of several dynamic list update algorithms for the case k=1. It turns out that move-to-front is very good: assuming fixed access probabilities for each item, it never requires more than twice the number of steps (moves and swaps) that would be required by the optimal (static) algorithm, in which all elements are listed in nonincreasing order of probability.
Interestingly, a couple of other plausible heuristics -- namely swapping with the previous element after finding the desired element, and maintaining order according to explicit frequency counts -- don't share this desirable property. OTOH, on p. 2 they mention that an earlier paper by Rivest showed that the expected amortised cost of any access under swap-with-previous is <= the corresponding cost under move-to-front.
I've only read the first few pages, but it looks relevant to me. Hope it helps!
You need to look at skip lists. There is a similar problem with arranging stations for a train system where there are express trains and regular trains. An express train stops only at express stations while regular trains stop at regular stations and express stations. Where should the express stops be placed so that one can minimize the average number of stops when travelling from a start station to any station.
The solution is to use stations at ternary numbers (i.e., at 1, 3, 6, 10 etc where T_n = n * (n + 1) / 2).
This is assuming all stops (or cards) are equally likely to be accessed.
If you know the access probabilities of your n cards in advance and you have k wallet slots and accesses are independent, isn't it fairly clear that the greedy solution is optimal? That is, the most frequently-accessed k cards go at the front of the pockets, next-most-frequently accessed k go immediately behind, and so forth? (You never want a lower-probability card ranked before a higher-probability card.)
If you don't know the access probabilities, but you do know they exist and that card accesses are independent, I imagine sorting the cards similarly, but by number-of-accesses-seen-so-far instead is asymptotically optimal. (Move-to-front is cool too, but I don't see an obvious reason to use it here.)
Perhaps you get something interesting if you penalise card moves as well; if I have any known probability distribution on card accesses, independent or not, I just greedily re-sort the cards every time I do an access.

Resources