Range search with KNN on two different dimensions - algorithm

I've a few million records (which are updated often) with 2 properties:
Timestamp
Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.

If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
Perform binary search for a particular timestamp O(logn)
Extract the greater than values in O(1) since the array is sorted
Extract the matching popularity vote in O(1) since they are in a dictionary
Update a record popularity score in O(1) due to the dictionary
Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.

Related

Finding tuple with maximum difference between its minimum and maximum first element

Given an array of elements of the form (a,b) where a is an integer and b is a string.
The array is sorted in terms of a the first element. We have to find the string b
which has the maximum difference between it's lowest a and highest a.
My Thoughts :
A simple approach is to hash each string to a HashTable ensuring that no two same strings
map to the same hashtable. Now consider any bucket for a string b we need to store only
two two elements in that bucket one the max a encountered till now and one the min a encountered
till now. Once the hashtable is populated we simple have to iterate over all strings and
find the one with the maximum difference.
Now this could run in O(N) time
But the only questionable assumption here is that the strings will go into a different bucket
That cannot be guaranted simply by any implementation of HashTable while maintaining the
average time complexity of insert, search and delete to be Theta(1)

Algorithm for top K stock in electronic exchange

You work in an electronic exchange. Throughout the day, you receive ticks (trading data) which consists of product name and its traded volume of stocks. Eg: {name: vodafone, volume: 20}
What data structure will you maintain if:
You have to tell top k products traded by volume at end of day.
You have to tell top k products traded by volume throughout the day.
What's the most efficient solution that you can think of?
The most efficient solution I could think of was to use a heap and map for both situations
heap to store stock by decreasing volume (updating - O(logn)and getTop k - O(k))
map to track stock volume (updating - O(1))
What you're looking for is a kind of map or dictionary which supports the following queries:
Add(key, x): add x to the total for that key, creating a new entry if it doesn't already exist.
GetKLargest(k): return the keys/totals for the k largest entries.
Let's say Q is the number of queries, and n is the number of distinct keys. We should assume that Q is much larger than n; choosing the NYSE as an example, there are a few thousand stocks traded, and a few million trades per day.
In the first scenario we assume that there are a large number of Add queries followed by one GetKLargest query. Since the cost of the Add query dominates, we can use a hashtable so that Add takes O(1) time, and then at the end of the day we can do GetKLargest in O(n log k) time using a priority queue of size k; note that we don't need to sort the whole key-set in O(n log n) time just to find the k largest elements. The total cost of answering Q queries is O(Q + n log k).
In the second scenario, we assume there could be a large number of both kinds of query. The cost of either query could dominate. A good option is to use an order statistic tree, which supports Add in O(log n) time, and GetKLargest in O(k log n) time. To look up a company by name in the tree requires a separate index, which can be maintained as a hashtable. The total cost is O(Qk log n) in the worst case.
If k is fixed or has a fixed limit, we can do better: keep the totals in a hashtable, but also maintain a priority queue of the current top k elements alongside. The cost of the Add query is now O(log k) because of maintaining the priority queue; to do this efficiently we need the map to also store the current index of each company in the priority queue, if it's there, otherwise searching the priority queue for the right company is O(k). The cost of GetKLargest is O(k) since we just output the contents of the priority queue. (The problem doesn't say we need to output them in order. If we do, then we can use a sorted array instead of a heap for the priority queue, and Add takes O(k) time.)
In this case, the total cost of answering Q queries is O(Qk). Note that this only works if we know in advance the maximum value of k that could be queried, before the query arrives; otherwise we don't know how big to make the priority queue.

Most Efficient way to compute the 99th percentile of a data set

I have 100 Integers in my database.
I sort them in ascending order.
Right now for the 99th percentile I am taking the 99th number after sorting.
after a given time t, a new number come into the database and an older number gets discarded.
The current code just take the 100 integer and sort them all over again.
Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?
PS:All this is done under MySQL database
Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).
The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).
Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].
Time complexity: O(N-K) per request.
(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.
If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".
If your data are random distributed you could try guessing the position by assuming a linear distribution.
guessPosition = newnumber*(max-min)/100
And then make a gallop search from that point out.
And when found insert it at the correct position.
So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

Updating & Querying all elements in array >= X where X is variable fast

Formally we are given an array with some initial values. Then we have 3 types of Queries :-
Point updates : Increment by 1 at a given position
Range Queries : To count number of elements>=x where x is taken as input
Range Updates : To decrement by 1 all elements>=x, where x is given as input.
N=105 , Q=105 (number of elements in array, number of Queries resp.)
I tried doing this with segment Tree but operations 2,3 can be worse than O(n) even as we don't know which 'range' is to be updated exactly so we may end up traversing whole of segment tree.
NOTE : I wish to clear that if we need to do all 3 operations in logarithmic Worst case ,ie O(log n) ,cause only then we can do this fast , linear approach doesn't works as Q=10^5 n N=10^5 , so worst case could be O(n^2) ,ie 10^10 operation which is clearly not feasible.
Given that you're talking about 105 items, and don't mention needing to add or remove items, it seems to me that the obvious data structure would be a simple sorted vector.
Operation complexities:
point update: O(1) + O(m) (where m is the number of subsequent elements equal to the value before the update).
Range query: O(log n) + O(m) (where n is start of range, m is elements in range).
Range update (same as range query).
It's a little difficult to be sure what "fast" means to you, but the fastest theoretically possible for 1 is O(1), so we're already within some constant factor of optimal.
For 2 and 3, even if we could do the find with constant complexity, we're pretty much stuck with O(m) for the update. Since Log2100000 = ~16.6, most of the time the O(m) term is going to dominate (i.e., the update part will involve as many operations as the search unless the given x is one of the last 17 items in the collection.
I doubt there's any point for this small of a collection, but if you might have to deal with a substantially larger collection and the items in the collection are reasonably predictably distributed, it might be worth considering doing an interpolating search instead of a binary search. With predictable distribution this reduces the expected number of comparisons to approximately O(log log n). In this case, that would be roughly 4 (but normally with a higher constant factor). This might be a win for 105 items, but then again it might not. If you might have to deal with a collection of (say) 108 items or more, it would be much more likely to be a substantial win.
The following may not be optimal, but is the best I could think of tonight.
Let's start by trying to turn the problem sideways. Instead of a map from indices to values, let's consider a map from values to sets of indices. A point update now involves removing an index from one set and adding it to another. A range update involves either simply moving an index set from one value to another or taking the union of two index sets. A range query involves folding over the sets corresponding to the values in range. A quick peek at Wikipedia suggests a traditional disjoint-set data structure is really great for set unions. Unfortunately, it's no good at all for removing an element from a set.
Fortunately, there is a newer data structure supporting union-find with constant time deletion! That takes care of both point updates and range updates quite naturally. Range queries, unfortunately, will require checking all array elements, even if very few elements are in range.

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Resources