Design a data structure to retrieve top trending keywords

Design a data structure to retrieve top trending keywords - algorithm

This question was posted on Leetcode here.
Design a data structure, that can return a top trending keyword. The time complexity should be as minimal as possible. TBH I don't know how yet, optimal the solution can be.
Given 2 parameters as input:
Parameter 1: String Username
Parameter 2: Array of String containing the keywords tweeted by the user.
Function declaration: function tweet(username, keywords[]){};
Example 1:
tweet("User1",["love","dog"])
tweet("User2",["cat"])
tweet("User3",["walk","cat"])
tweet("User2",["dog"])
tweet("User3",["like","dog"])
top trending keyword : Dog
Example 2:
a. tweet("User1",["Dog"])
b. tweet("User1",["like","Dog"])
c. tweet("User1",["love","Dog"])
d. tweet("User1",["walk","Dog"])
e. tweet("User1",["hate","Dog"])
f. tweet("User2",["like","cat"])
g. tweet("User3",["cat"])
top trending keyword: cat
explanation: Only consider the number of unique users who tweeted a particular keyword while calculating the top trending keyword.
For this question, I was able to come up with a solution using (similar to one posted on Leetcode here)
1. Map, which holds the set of words for a given user,
2. Map which holds the word and it's unique user count.
3. Max Heap -> used to retrieve the top word based on the frequency.
However, for all the words that's already in Map 2, if I add it to the PQ, I need to do a remove operation which is O(n), and then add it again with increased frequency in the PQ.
E.g. in example 2 above, up to operation e
After a, Map1-[<User1,[Dog]>], Map2- [[<Dog,1>], PQ-[1-Dog]
After b, Map1-[[<User1,[Dog,like]>]], Map2- [<Dog,1>,<like,1>], PQ-[1-Dog,1-like]
...
After e,
Map1- [[<User1,[Dog,like,love,walk,hate]>]],
Map2- [<Dog,1>,<like,1>,<love,1>,<walk,1>,<hate,1>],
**PQ-[1-Dog,1-like,1-love,1-walk,1-hate]**
After f,
Map1- [<User1,[Dog,like,love,walk,hate]>,<User2,[like,cat]>],
Map2- [<Dog,1>,<like,2>,<love,1>,<walk,1>,<hate,1>,<cat,1>],
**PQ- [2-like,1-Dog,1-love,1-walk,1-hate]**
My question is: After adding entry "User2 - like,cat" in step f above, I need to re-balance the max heap, i.e. remove "like" and add it back. So that now it's at the top of the heap.
Is this the optimal way? Or I can optimize it further. So that I don't incur the cost of remove() or re-balancing.
I tried with a TreeMap too, but cannot figure out the data structures.

Related

Is there an efficient solution to this coding interview question?

We're given two strings that act as search queries. We need to determine if they're the same.
For example:
Query 1: stock price rate
Query 2: share cost rate
We're also given a list containing where each entry has two words that are synonyms. the words could be repeated meaning a transitive relation exists. Something like this:
[
[cost,price]
[rate,price]
[share,equity]
]
Goal is determine whether the queries mean the same thing.
I've proposed a solution where i group similar meaning words into lists and doing an exhaustive search until we find the word from query1 and then searching it's group for word from query 2. But the interviewer wanted a more efficient approach which i couldn't figure out. Is there a more efficient way to solve this issue?

Here is a solution that would allow to tell if 2 queries are similar in near constant time (O(size of queries)), with precomputing in O(number of words in database).
Precomputing: We assume that you have a list of lists of synonyms L
function build_hashmap(L):
H <- new Hashmap()
i <- 0
for each synonyms_list in L do:
for each word in synonyms_list do:
H[word] <- i
i <- i+1
return H
Now we can test if two words are synonyms using H
function is_synonym(w1, w2, H):
if H[w1] == H[w2]:
return true
else:
return False
From there it should be rather easy to tell if two queries have the same meaning.
Edit:
A fast solution could be to implement 'union-find' algorithm in order to build the hashmap.
Another way would be to first model the words as vertices of a graph, and to add edges for relations of synonymity.
Then you can build your hashmap by finding the connected components of the graph. Finding connected components in a graph can be done by traversing it.

Given sequence of data(start location, end location) representing bookings for a single cab, find the optimal non breaking sequence

I have been trying to solve an optimization problem but could not able to think it through for any efficient solution.
Here's the problem
We are given data representing a sequence of bookings on a single car. Each booking data consist of two points (start location, end location). Now given two adjacent bookings b1,b2, we say a relocation is required between those bookings if the end location of b1 not equal to the start location of b2
We have to design an algorithm that takes a sequence of bookings as
input and outputs a single permutation of the input that minimizes the
total number of relocations within the sequence.
Here's my approach
To me, it looks like one of the greedy scheduling problems but I'm not able to derive any good heuristics to solve this problem from any of the existing scheduling problems. At last, I thought of sorting the given sequence on the basis of the minimum difference between start time and end time of the two adjacent sequence using insertion sort.
So, for our given problem
[(23, 42),(77, 45),(42, 77)] will get sorted to
[(23, 42),(42, 77),(77, 45)] thus minimizing end point my start point.
Let's take another example
[(3,1),(1,3),(3,1),(2,2),(3,1),(2,3),(1,3),(1,1),(3,3),(3,2),(3,3)]
now after sorting till index 7 using insertion sort, our array will look like
[(3,1),(1,3),(3,1),(2,2),(2,3),(3,3),(3,1),(1,3),(3,3),(3,2),(3,3)]
Now for placing point (3,3) present at index 8 in the unsorted array we will do the following
The idea is to put each point in its correct location. For the point
(3,3) at index 8 I will search in the already sorted array the first
entry whose endpoint matches 3 i.e. starting point of this new point,
given the condition that adding this point after that first found
entry does not violate the variant that start of next entry should
match the end of this point. So, we inserted (3,3) in between (2,3)
and (3,1) at index. It looks like this
[(3,1),(1,3),(3,1),(2,2),(2,3),(3,3),(3,1),(1,3),(3,3),(3,2),(1,1)]
However, I'm not sure how will I prove that this is the optimal or not optimal solution. Any pointer is highly appreciated. Is there a better way which I'm sure there is which will help us solve this.

You can convert this easily into a graph problem.
[a, b] -> vertices a and b with an edge between a and b. Use DFS to find all connected components in this undirected graph and do some post processing.
It is linear in input size.

Are there sorting algorithms that respect final position restrictions and run in O(n log n) time?

I'm looking for a sorting algorithm that honors a min and max range for each element1. The problem domain is a recommendations engine that combines a set of business rules (the restrictions) with a recommendation score (the value). If we have a recommendation we want to promote (e.g. a special product or deal) or an announcement we want to appear near the top of the list (e.g. "This is super important, remember to verify your email address to participate in an upcoming promotion!") or near the bottom of the list (e.g. "If you liked these recommendations, click here for more..."), they will be curated with certain position restriction in place. For example, this should always be the top position, these should be in the top 10, or middle 5 etc. This curation step is done ahead of time and remains fixed for a given time period and for business reasons must remain very flexible.
Please don't question the business purpose, UI or input validation. I'm just trying to implement the algorithm in the constraints I've been given. Please treat this as an academic question. I will endeavor to provide a rigorous problem statement, and feedback on all other aspects of the problem is very welcome.
So if we were sorting chars, our data would have a structure of
struct {
char value;
Integer minPosition;
Integer maxPosition;
}
Where minPosition and maxPosition may be null (unrestricted). If this were called on an algorithm where all positions restrictions were null, or all minPositions were 0 or less and all maxPositions were equal to or greater than the size of the list, then the output would just be chars in ascending order.
This algorithm would only reorder two elements if the minPosition and maxPosition of both elements would not be violated by their new positions. An insertion-based algorithm which promotes items to the top of the list and reorders the rest has obvious problems in that every later element would have to be revalidated after each iteration; in my head, that rules out such algorithms for having O(n3) complexity, but I won't rule out such algorithms without considering evidence to the contrary, if presented.
In the output list, certain elements will be out of order with regard to their value, if and only if the set of position constraints dictates it. These outputs are still valid.
A valid list is any list where all elements are in a position that does not conflict with their constraints.
An optimal list is a list which cannot be reordered to more closely match the natural order without violating one or more position constraint. An invalid list is never optimal. I don't have a strict definition I can spell out for 'more closely matching' between one ordering or another. However, I think it's fairly easy to let intuition guide you, or choose something similar to a distance metric.
Multiple optimal orderings may exist if multiple inputs have the same value. You could make an argument that the above paragraph is therefore incorrect, because either one can be reordered to the other without violating constraints and therefore neither can be optimal. However, any rigorous distance function would treat these lists as identical, with the same distance from the natural order and therefore reordering the identical elements is allowed (because it's a no-op).
I would call such outputs the correct, sorted order which respects the position constraints, but several commentators pointed out that we're not really returning a sorted list, so let's stick with 'optimal'.
For example, the following are a input lists (in the form of <char>(<minPosition>:<maxPosition>), where Z(1:1) indicates a Z that must be at the front of the list and M(-:-) indicates an M that may be in any position in the final list and the natural order (sorted by value only) is A...M...Z) and their optimal orders.
Input order
A(1:1) D(-:-) C(-:-) E(-:-) B(-:-)
Optimal order
A B C D E
This is a trivial example to show that the natural order prevails in a list with no constraints.
Input order
E(1:1) D(2:2) C(3:3) B(4:4) A(5:5)
Optimal order
E D C B A
This example is to show that a fully constrained list is output in the same order it is given. The input is already a valid and optimal list. The algorithm should still run in O(n log n) time for such inputs. (Our initial solution is able to short-circuit any fully constrained list to run in linear time; I added the example both to drive home the definitions of optimal and valid and because some swap-based algorithms I considered handled this as the worse case.)
Input order
E(1:1) C(-:-) B(1:5) A(4:4) D(2:3)
Optimal Order
E B D A C
E is constrained to 1:1, so it is first in the list even though it has the lowest value. A is similarly constrained to 4:4, so it is also out of natural order. B has essentially identical constraints to C and may appear anywhere in the final list, but B will be before C because of value. D may be in positions 2 or 3, so it appears after B because of natural ordering but before C because of its constraints.
Note that the final order is correct despite being wildly different from the natural order (which is still A,B,C,D,E). As explained in the previous paragraph, nothing in this list can be reordered without violating the constraints of one or more items.
Input order
B(-:-) C(2:2) A(-:-) A(-:-)
Optimal order
A(-:-) C(2:2) A(-:-) B(-:-)
C remains unmoved because it already in its only valid position. B is reordered to the end because its value is less than both A's. In reality, there will be additional fields that differentiate the two A's, but from the standpoint of the algorithm, they are identical and preserving OR reversing their input ordering is an optimal solution.
Input order
A(1:1) B(1:1) C(3:4) D(3:4) E(3:4)
Undefined output
This input is invalid for two reasons: 1) A and B are both constrained to position 1 and 2) C, D, and E are constrained to a range than can only hold 2 elements. In other words, the ranges 1:1 and 3:4 are over-constrained. However, the consistency and legality of the constraints are enforced by UI validation, so it's officially not the algorithms problem if they are incorrect, and the algorithm can return a best-effort ordering OR the original ordering in that case. Passing an input like this to the algorithm may be considered undefined behavior; anything can happen. So, for the rest of the question...
All input lists will have elements that are initially in valid positions.
The sorting algorithm itself can assume the constraints are valid and an optimal order exists.2
We've currently settled on a customized selection sort (with runtime complexity of O(n2)) and reasonably proved that it works for all inputs whose position restrictions are valid and consistent (e.g. not overbooked for a given position or range of positions).
Is there a sorting algorithm that is guaranteed to return the optimal final order and run in better than O(n2) time complexity?3
I feel that a library standard sorting algorithm could be modified to handle these constrains by providing a custom comparator that accepts the candidate destination position for each element. This would be equivalent to the current position of each element, so maybe modifying the value holding class to include the current position of the element and do the extra accounting in the comparison (.equals()) and swap methods would be sufficient.
However, the more I think about it, an algorithm that runs in O(n log n) time could not work correctly with these restrictions. Intuitively, such algorithms are based on running n comparisons log n times. The log n is achieved by leveraging a divide and conquer mechanism, which only compares certain candidates for certain positions.
In other words, input lists with valid position constraints (i.e. counterexamples) exist for any O(n log n) sorting algorithm where a candidate element would be compared with an element (or range in the case of Quicksort and variants) with/to which it could not be swapped, and therefore would never move to the correct final position. If that's too vague, I can come up with a counter example for mergesort and quicksort.
In contrast, an O(n2) sorting algorithm makes exhaustive comparisons and can always move an element to its correct final position.
To ask an actual question: Is my intuition correct when I reason that an O(n log n) sort is not guaranteed to find a valid order? If so, can you provide more concrete proof? If not, why not? Is there other existing research on this class of problem?
1: I've not been able to find a set of search terms that points me in the direction of any concrete classification of such sorting algorithm or constraints; that's why I'm asking some basic questions about the complexity. If there is a term for this type of problem, please post it up.
2: Validation is a separate problem, worthy of its own investigation and algorithm. I'm pretty sure that the existence of a valid order can be proven in linear time:
Allocate array of tuples of length equal to your list. Each tuple is an integer counter k and a double value v for the relative assignment weight.
Walk the list, adding the fractional value of each elements position constraint to the corresponding range and incrementing its counter by 1 (e.g. range 2:5 on a list of 10 adds 0.4 to each of 2,3,4, and 5 on our tuple list, incrementing the counter of each as well)
Walk the tuple list and
If no entry has value v greater than the sum of the series from 1 to k of 1/k, a valid order exists.
If there is such a tuple, the position it is in is over-constrained; throw an exception, log an error, use the doubles array to correct the problem elements etc.
Edit: This validation algorithm itself is actually O(n2). Worst case, every element has the constraints 1:n, you end up walking your list of n tuples n times. This is still irrelevant to the scope of the question, because in the real problem domain, the constraints are enforced once and don't change.
Determining that a given list is in valid order is even easier. Just check each elements current position against its constraints.
3: This is admittedly a little bit premature optimization. Our initial use for this is for fairly small lists, but we're eyeing expansion to longer lists, so if we can optimize now we'd get small performance gains now and large performance gains later. And besides, my curiosity is piqued and if there is research out there on this topic, I would like to see it and (hopefully) learn from it.

On the existence of a solution: You can view this as a bipartite digraph with one set of vertices (U) being the k values, and the other set (V) the k ranks (1 to k), and an arc from each vertex in U to its valid ranks in V. Then the existence of a solution is equivalent to the maximum matching being a bijection. One way to check for this is to add a source vertex with an arc to each vertex in U, and a sink vertex with an arc from each vertex in V. Assign each edge a capacity of 1, then find the max flow. If it's k then there's a solution, otherwise not.
http://en.wikipedia.org/wiki/Maximum_flow_problem
--edit-- O(k^3) solution: First sort to find the sorted rank of each vertex (1-k). Next, consider your values and ranks as 2 sets of k vertices, U and V, with weighted edges from each vertex in U to all of its legal ranks in V. The weight to assign each edge is the distance from the vertices rank in sorted order. E.g., if U is 10 to 20, then the natural rank of 10 is 1. An edge from value 10 to rank 1 would have a weight of zero, to rank 3 would have a weight of 2. Next, assume all missing edges exist and assign them infinite weight. Lastly, find the "MINIMUM WEIGHT PERFECT MATCHING" in O(k^3).
http://www-math.mit.edu/~goemans/18433S09/matching-notes.pdf
This does not take advantage of the fact that the legal ranks for each element in U are contiguous, which may help get the running time down to O(k^2).

Here is what a coworker and I have come up with. I think it's an O(n2) solution that returns a valid, optimal order if one exists, and a closest-possible effort if the initial ranges were over-constrained. I just tweaked a few things about the implementation and we're still writing tests, so there's a chance it doesn't work as advertised. This over-constrained condition is detected fairly easily when it occurs.
To start, things are simplified if you normalize your inputs to have all non-null constraints. In linear time, that is:
for each item in input
if an item doesn't have a minimum position, set it to 1
if an item doesn't have a maximum position, set it to the length of your list
The next goal is to construct a list of ranges, each containing all of the candidate elements that have that range and ordered by the remaining capacity of the range, ascending so ranges with the fewest remaining spots are on first, then by start position of the range, then by end position of the range. This can be done by creating a set of such ranges, then sorting them in O(n log n) time with a simple comparator.
For the rest of this answer, a range will be a simple object like so
class Range<T> implements Collection<T> {
int startPosition;
int endPosition;
Collection<T> items;
public int remainingCapacity() {
return endPosition - startPosition + 1 - items.size();
}
// implement Collection<T> methods, passing through to the items collection
public void add(T item) {
// Validity checking here exposes some simple cases of over-constraining
// We'll catch these cases with the tricky stuff later anyways, so don't choke
items.add(item);
}
}
If an element A has range 1:5, construct a range(1,5) object and add A to its elements. This range has remaining capacity of 5 - 1 + 1 - 1 (max - min + 1 - size) = 4. If an element B has range 1:5, add it to your existing range, which now has capacity 3.
Then it's a relatively simple matter of picking the best element that fits each position 1 => k in turn. Iterate your ranges in their sorted order, keeping track of the best eligible element, with the twist that you stop looking if you've reached a range that has a remaining size that can't fit into its remaining positions. This is equivalent to the simple calculation range.max - current position + 1 > range.size (which can probably be simplified, but I think it's most understandable in this form). Remove each element from its range as it is selected. Remove each range from your list as it is emptied (optional; iterating an empty range will yield no candidates. That's a poor explanation, so lets do one of our examples from the question. Note that C(-:-) has been updated to the sanitized C(1:5) as described in above.
Input order
E(1:1) C(1:5) B(1:5) A(4:4) D(2:3)
Built ranges (min:max) <remaining capacity> [elements]
(1:1)0[E] (4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Find best for 1
Consider (1:1), best element from its list is E
Consider further ranges?
range.max - current position + 1 > range.size ?
range.max = 1; current position = 1; range.size = 1;
1 - 1 + 1 > 1 = false; do not consider subsequent ranges
Remove E from range, add to output list
Find best for 2; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Consider (4:4); skip it because it is not eligible for position 2
Consider (2:3); best element is D
Consider further ranges?
3 - 2 + 1 > 1 = true; check next range
Consider (2:5); best element is B
End of range list; remove B from range, add to output list
An added simplifying factor is that the capacities do not need to be updated or the ranges reordered. An item is only removed if the rest of the higher-sorted ranges would not be disturbed by doing so. The remaining capacity is never checked after the initial sort.
Find best for 3; output is now E, B; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C]
Consider (4:4); skip it because it is not eligible for position 3
Consider (2:3); best element is D
Consider further ranges?
same as previous check, but current position is now 3
3 - 3 + 1 > 1 = false; don't check next range
Remove D from range, add to output list
Find best for 4; output is now E, B, D; current range list is:
(4:4)0[A] (1:5)3[C]
Consider (4:4); best element is A
Consider further ranges?
4 - 4 + 1 > 1 = false; don't check next range
Remove A from range, add to output list
Output is now E, B, D, A and there is one element left to be checked, so it gets appended to the end. This is the output list we desired to have.
This build process is the longest part. At its core, it's a straightforward n2 selection sorting algorithm. The range constraints only work to shorten the inner loop and there is no loopback or recursion; but the worst case (I think) is still sumi = 0 n(n - i), which is n2/2 - n/2.
The detection step comes into play by not excluding a candidate range if the current position is beyond the end of that ranges max position. You have to track the range your best candidate came from in order to remove it, so when you do the removal, just check if the position you're extracting the candidate for is greater than that ranges endPosition.
I have several other counter-examples that foiled my earlier algorithms, including a nice example that shows several over-constraint detections on the same input list and also how the final output is closest to the optimal as the constraints will allow. In the mean time, please post any optimizations you can see and especially any counter examples where this algorithm makes an objectively incorrect choice (i.e. arrives at an invalid or suboptimal output when one exists).
I'm not going to accept this answer, because I specifically asked if it could be done in better than O(n2). I haven't wrapped my head around the constraints satisfaction approach in #DaveGalvin's answer yet and I've never done a maximum flow problem, but I thought this might be helpful for others to look at.
Also, I discovered the best way to come up with valid test data is to start with a valid list and randomize it: for 0 -> i, create a random value and constraints such that min < i < max. (Again, posting it because it took me longer than it should have to come up with and others might find it helpful.)

Not likely*. I assume you mean average run time of O(n log n) in-place, non-stable, off-line. Most Sorting algorithms that improve on bubble sort average run time of O(n^2) like tim sort rely on the assumption that comparing 2 elements in a sub set will produce the same result in the super set. A slower variant of Quicksort would be a good approach for your range constraints. The worst case won't change but the average case will likely decrease and the algorithm will have the extra constraint of a valid sort existing.
Is ... O(n log n) sort is not guaranteed to find a valid order?
All popular sort algorithms I am aware of are guaranteed to find an order so long as there constraints are met. Formal analysis (concrete proof) is on each sort algorithems wikepedia page.
Is there other existing research on this class of problem?
Yes; there are many journals like IJCSEA with sorting research.
*but that depends on your average data set.

Re-order a ranked-list based on new partial rank

I have a question about ranking algorithm that might hasn't exist so far:
I have a list ordered by a score, for example a following list (denotes list-a):
Now I have new information to know that the list should be ranked as follow (denotes list-b):
The question in here is: How to construct a new ranking for the list-a follow restriction in list-b?
We can say that the new list must:
It must follow the rank in the list-b
Try to have less conflict with the rank in the list-a. (e.g about conflict: list-a says a>b, but now we say b>a => conflict).
The problem in here is the list-b doesn't have information about c, e, g (marked by red color in list-a). Now we need to construct a new ranking for list-a follow restriction in list-b.
My current solution:
Sure that we can solve it by using a brute force strategy as follow: add to the list-b the missing items c, e, g one by one and find the best place for it by:
Select one place for it in list-b (e.g: a > c > d > b > f)
Next check number of conflict with list-a, then select a position that have less conflict.
For example with c, we can do as follow:
When we have equal number of conflict for different position, then we select the first position (i guess). Just follow this way, we can add up to the final item.
This is my "bad way" to do it, so do you have any better idea for this problem? Because my list is really long (about 1 million items), if follow this way, it must be too expensive for computation.
Looking forward to hearing your suggestion.

Interesting problem. I am assuming that List B is going to also have the updated scores. So what you could do is make List A into a dictionary where item is the key and value is the score. Then you could iterate through list B and look up the items in constant time and update the score. You could then make the dictionary back into a collection and use built in sort. This would run in O(nlogn). Hope this helps

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.

This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio