Algorithm for top K stock in electronic exchange - algorithm

You work in an electronic exchange. Throughout the day, you receive ticks (trading data) which consists of product name and its traded volume of stocks. Eg: {name: vodafone, volume: 20}
What data structure will you maintain if:
You have to tell top k products traded by volume at end of day.
You have to tell top k products traded by volume throughout the day.
What's the most efficient solution that you can think of?
The most efficient solution I could think of was to use a heap and map for both situations
heap to store stock by decreasing volume (updating - O(logn)and getTop k - O(k))
map to track stock volume (updating - O(1))

What you're looking for is a kind of map or dictionary which supports the following queries:
Add(key, x): add x to the total for that key, creating a new entry if it doesn't already exist.
GetKLargest(k): return the keys/totals for the k largest entries.
Let's say Q is the number of queries, and n is the number of distinct keys. We should assume that Q is much larger than n; choosing the NYSE as an example, there are a few thousand stocks traded, and a few million trades per day.
In the first scenario we assume that there are a large number of Add queries followed by one GetKLargest query. Since the cost of the Add query dominates, we can use a hashtable so that Add takes O(1) time, and then at the end of the day we can do GetKLargest in O(n log k) time using a priority queue of size k; note that we don't need to sort the whole key-set in O(n log n) time just to find the k largest elements. The total cost of answering Q queries is O(Q + n log k).
In the second scenario, we assume there could be a large number of both kinds of query. The cost of either query could dominate. A good option is to use an order statistic tree, which supports Add in O(log n) time, and GetKLargest in O(k log n) time. To look up a company by name in the tree requires a separate index, which can be maintained as a hashtable. The total cost is O(Qk log n) in the worst case.
If k is fixed or has a fixed limit, we can do better: keep the totals in a hashtable, but also maintain a priority queue of the current top k elements alongside. The cost of the Add query is now O(log k) because of maintaining the priority queue; to do this efficiently we need the map to also store the current index of each company in the priority queue, if it's there, otherwise searching the priority queue for the right company is O(k). The cost of GetKLargest is O(k) since we just output the contents of the priority queue. (The problem doesn't say we need to output them in order. If we do, then we can use a sorted array instead of a heap for the priority queue, and Add takes O(k) time.)
In this case, the total cost of answering Q queries is O(Qk). Note that this only works if we know in advance the maximum value of k that could be queried, before the query arrives; otherwise we don't know how big to make the priority queue.

Related

Range search with KNN on two different dimensions

I've a few million records (which are updated often) with 2 properties:
Timestamp
Popularity score
I'm looking for a data structure (maybe some metric tree?) that can do fast range search on 1 dimension (e.g. all records greater than a timestamp value), and locate top K records that fall within that range on the other dimension (i.e. popularity score). In other words, I can phrase this query as "Find top K popular records with timestamp greater than T".
I currently have a naive implementation where I filter the N records in linear time complexity and then identify the top K records using a partial sorting algorithm. But this is not fast enough given the number of concurrent users we need to support.
I'm not super familiar with KD trees, but I see that some popular implementations support both range searches and finding K nearest neighbors, but my requirements are a bit peculiar here -- so I'm wondering if there is a way to do this faster, at the expense of maybe additional indexing overhead.
If you will invest the initial sorting of a list of tuples (record_name, timestamp) by the timestamp, and create a dictionary with the record name as keys and (popularity_score, timestamp_list_idx) tuples as values you will be able to:
Perform binary search for a particular timestamp O(logn)
Extract the greater than values in O(1) since the array is sorted
Extract the matching popularity vote in O(1) since they are in a dictionary
Update a record popularity score in O(1) due to the dictionary
Update a particular timestamp in O(1) via pulling the index of the record from the tuple in the dictionary value
Suppose you have m records with the wanted timestamp range, you can
generate a max heap from them by popularity, this takes O(m) and then perform k pops from that heap with O(klogm) since we need to repopulate the root after every pop. This means that the actions you want to do will take O(m + klogm). Assuming k << m this will run in O(m).
Iterate over the m records with a list in size k to keep track of thr top k popular songs. After passing over all m records you will have the top k in the list. This takes O(m) as well
Method 1 take a little more time than method 2 in terms of complexity but in case you suddenly want to know the k+1 most popular record, you can just pop abother item from the heap instead of passing over the entire m records again with a k+1 long list.

How to use balanced binary trees to solve this challenge?

1(70)
/ \
/ \
2(40) 5(10)
/ \ \
/ \ \
3(60) 4(80) 6(20)
/ \
/ \
7(30) 8(50)
This is for an online challenge (not live contest). I don't need someone to solve for me, just to push in right direction. Trying to learn.
Each node has a unique ID, no two people have same salary. Person #1 has salary $70, person #7 has $30 salary, for example. Tree structure denotes who supervises who. Question is who has kth lowest salary on a person's subordinates.
For example I choose person #2. Who is 2nd lowest among subordinates? #2's subordinates are 3, 4, 7, 8. 2nd lowest salary is $50 belonging to person #8.
There are many queries so structure to must be efficient.
I thought about this problem and researched data structures. Binary tree seems like a good idea but I need help.
For example I think ideal structure look like, for person #2:
2(40)
/ \
/ \
7(30) 3(60)
/ \
/ \
8(50) 4(80)
Every child node is subordinate of #2, every left branch has lower salary than on right. If I store how many children at each node I can get kth lowest.
For example: From #2, left branch 1 node, right branch 3 nodes. So 2nd lowest - 1 means I now want 1st lowest in right branch.
Move to #3, 1st lowest points to #8 with $50 which is correct.
My question:
Is this approach as I describe it a good one? Is it a valid approach?
I am having trouble figuring out how to construct this kind of tree. I think I can make them recursively. But hard to figure out how to make all children into new tree sorted by salary. Need some light help.
Here's a solution that uses O(n log^2 n + q log n) time and O(n log^2 n) space (not the best on the latter count, but probably good enough given the limits).
Implement a purely functional sorted list (as an augmented binary search tree) with the following operations and some way to iterate.
EmptyList() -> returns the empty list
Insert(list, key) -> returns the list where |key| has been inserted into |list|
Length(list) -> returns the length of the list
Get(list, k) -> returns the element at index |k| in |list|
On top of these operations, implement an operation
Merge(list1, list2) -> returns the union of |list1| and |list2|
by inserting the elements of the shorter list into the longer.
Now do the obvious thing: traverse the employee hierarchy from leaves to root, setting the ordered list for each employee to the appropriate merge of her subordinate lists, and answer the queries.
Analysis (sketch)
Each query takes O(log n) time. The interesting part of the analysis pertains to the preprocessing.
The cost of preprocessing is dominated by the cost of calling Insert(), specifically from Merge(), since there are n other insertions. Each insertion takes O(log n) time and costs O(log n) space (measuring in words).
What keeps the preprocessing from being quadratic is an implicit heavy path decomposition. Every time we merge two lists, neither list is merged subsequently. Since the shorter list is inserted into the longer, every time a key is inserted into a list, that list is at least twice as long as the list into which that key was previously inserted. It follows that each key is the subject of at most lg n insertions, which suffices to establish a bound of O(n log n) insertions overall and thus the claimed resource bounds.
Here is one possible solution. For each node, we will construct an array of all children's values of that node and keep this in sorted order. The result we are looking for is a dictionary of the form
{ 1 : [10, 20, 30, 40, 60 80],
2 : [30, 50, 60, 80]
...
}
Once we have this, to query for any node for the ith lowest salary, just take the ith element of the array. The total time to do all the queries is O( q ) where q is the number of queries.
How do we construct this? Assuming you have a pointer to the root node, you can recursively construct the sorted salaries for each child. Store those values in the result. Make a copy of each child's array, and insert each child's salary into the child's copied array. Use binary search to find the position, since each array is sorted. Now you have k sorted arrays, you merge them to get a sorted array. If you are merging two arrays, this can be done in linear time. Simply loop, picking the first element of the array that is smaller each time.
For the case where each node has 2 children, merging the two children's arrays is O(n) tine. Inserting the salaries of each node to its corresponding array is O(log(n)) per node since we use binary search. Copying the children array is O(n), and there are n nodes, so we have O(n^2) total time for pre-processing.
Total run time is O(n^2 + q)
What if we can not assume each node has at most 2 children? Then to merge the arrays, use this algorithm. This runs in O(nlog(k)) where k is the number of arrays to merge, since we pop from the heap once per element, and resizing the heap takes O(log(k)) when there are k arrays. k<=n so we can simplify this to O(nlog(n)). So the total running time is unchanged.
The space complexity of this solution is O(n^2).
The question has two parts: first, finding the specified person, and then finding the kth subordinate.
Since the tree is not ordered by id, to find the specified perspn by id requires walking the whole tree until the specified id is found. To speed up this part, we can builda hash map that would allow us to find the person node by id in O(1) time and require O(n) space and set-up time.
Then, to find the subordinate with the kth lowest salary, we need to search the subtree. Since its not ordered by salary, we would have to scan the whole subtree and find the kth lowest salary. This could be done using an array or a heap (putting the subtree nodes into array or heap). This second part would O(m log k) time using the heap to keep the lowest k items, where m is the number of sub-ordinates, and require O(k) space. This should be acceptable if m (number of subordinates of specified person), and k are small.

Design an algorithm to return number of unique users between given time interval

A Web Server can receive millions of user login request. A user can login multiple times. Design an optimal algorithm/data-structure in terms of time complexity to return total number of unique users between given time intervals.
For e.g : Count total number of unique users between interval t1 & t2 and t2 & t3. Also think about returning total count for overlapping intervals (t1 = 10am, t2 = 10:15am, t3 = 10:30am, return total number of users between 10:10am to 10:20am)
Below is what I am proposing, Would appreciate people's comments?
IMO combination of Hashmap & min heap would be good as an optimal solution.
Hashmap- will have key as user-id and value as node pointer to corresponding node in
min heap.
Min heap - last-logged-in time as key and value as the user-id. Root will be the user
whose log-in time is the oldest. Also at the root store the count of total
number of nodes in the min-heap so we can quickly return the count.
When a user logs-in. Lookup in the hashmap with user-id as key.
a) If no match then insert the new user into hashmap & a new node for user into
the min heap and increment the count stored with the root node of min heap.
b) else it's an old user update it's last-logged-in value and do not increment the
count in root node of min heap.
Whenever we want to find out the unique users logged between t2-t1 then
a. Extract min(root) from the heap and check if
current time - last-logged-in time > t2-t1 mins. If it is greater than
delete the value from the hashmap & min_heap.
b. Repeat the above step (a) until the min element of the heap satisfies
current time - last-logged-in time <= t2-t1 mins
c. Return the value of the count from root node of the min heap.
But I am not able to nail down the algorithm for overlapping intervals.
I think there is a much easier way to do this. Consider storing all the data in a balanced binary search tree, where the keys are the login times and the values are the list of all people who logged in at that time (assuming that there can be multiple logins at exactly the same moment in time). From there, you can find all people who logged in during an interval between time T1 and T2 by finding the smallest node in the BST whose time is greater than or equal to T1, then continuously computing the inorder successor of that node until you arrive at a node that is at time strictly after time T2.
Doing a lookup in the BST to find the first node at time greater than or equal to T1 will take time O(log n) in a balanced BST, and computing the inorder successor many times will take time O(k), where k is the total number of matches you report. This takes a total of O(log n + k) time. Since you have to spend at least O(k) time reporting all matches logins in any algorithm, this has a very low overhead.
Alternatively, if you are getting the data in a stream from the server (i.e. new logins are always happening as time evolves), you can just use a standard array to hold all the requests. You can just append new requests to the end of the array. Since time always moves forward, this means that the array is always sorted, so you can use binary search to find the start point of the range. Assuming the data isn't pathologically constructed, you could also use interpolation search to make the lookup times expected O(log log n) rather than O(log n), giving expected O(log log n + k) lookup times when finding k total elements.
As for handling overlapping ranges - there are standard algorithms for taking a collection of ranges and merging them together into a minimal number of nonoverlapping ranges. You can always apply one of those techniques prior to doing lookups to handle this case.
Hope this helps!

Sort object pairs into separate bins

I have N object pairs (master copy/slave copy) all with the same size. I wish to distribute the copies among M bins each with a different capacity so that no bin will include both the master and slave copy.
What's the most efficient algorithm? And more importantly what's the most efficient algorithm to find out if there is a possible solution for a given input (without actually generating the solution)?
Hard to imagine anything better thn brute force: track the M bins in a prioirty queue by descending remaining capacity, and add each object pair to the first two bins in the queue; rebalance queue and repeat. Solution exists if total capacity of the M bins >= 2*N.
That would seem to be complexity O(N * log M)
Note: For exactly three bins, no solution exists for N > M1 + M2 where Mn is the capacity of bin n sorted by descending capacity for n in range 0..M, regardless of the capacity of M0.
Likewise for exactly 2 bins, solutions exist only for N <= M1.
A simple solutions is:
Sort the M buckets in descending order according to their capacity: x1, x2,..,xm
Pick the topmost two buckets, assign an object to that pair, decrement the available capacities of the two buckets and rearrange the buckets. You can use a heap to keep track of buckets and the complexity is close to O(n)
Keep repeating until all the objects are allocated.

Find the largest k numbers in k arrays stored across k machines

This is an interview question. I have K machines each of which is connected to 1 central machine. Each of the K machines have an array of 4 byte numbers in file. You can use any data structure to load those numbers into memory on those machines and they fit. Numbers are not unique across K machines. Find the K largest numbers in the union of the numbers across all K machines. What is the fastest I can do this?
(This is an interesting problem because it involves parallelism. As I haven't encountered parallel algorithm optimization before, it's quite amusing: you can get away with some ridiculously high-complexity steps, because you can make up for it later. Anyway, onto the answer...)
> "What is the fastest I can do this?"
The best you can do is O(K). Below I illustrate both a simple O(K log(K)) algorithm, and the more complex O(K) algorithm.
First step:
Each computer needs enough time to read every element. This means that unless the elements are already in memory, one of the two bounds on the time is O(largest array size). If for example your largest array size varies as O(K log(K)) or O(K^2) or something, no amount of algorithmic trickery will let you go faster than that. Thus the actual best running time is O(max(K, largestArraySize)) technically.
Let us say the arrays have a max length of N, which is <=K. With the above caveat, we're allowed to bound N<K since each computer has to look at each of its elements at least once (O(N) preprocessing per computer), each computer can pick the largest K elements (this is known as finding kth-order-statistics, see these linear-time algorithms). Furthermore, we can do so for free (since it's also O(N)).
Bounds and reasonable expectations:
Let's begin by thinking of some worst-case scenarios, and estimates for the minimum amount of work necessary.
One minimum-work-necessary estimate is O(K*N/K) = O(N), because we need to look at every element at the very least. But, if we're smart, we can distribute the work evenly across all K computers (hence the division by K).
Another minimum-work-necessary estimate is O(N): if one array is larger than all elements on all other computers, we return the set.
We must output all K elements; this is at least O(K) to print them out. We can avoid this if we are content merely knowing where the elements are, in which case the O(K) bound does not necessarily apply.
Can this bound of O(N) be achieved? Let's see...
Simple approach - O(NlogN + K) = O(KlogK):
For now let's come up with a simple approach, which achieves O(NlogN + K).
Consider the data arranged like so, where each column is a computer, and each row is a number in the array:
computer: A B C D E F G
10 (o) (o)
9 o (o) (o)
8 o (o)
7 x x (x)
6 x x (x)
5 x ..........
4 x x ..
3 x x x . .
2 x x . .
1 x x .
0 x x .
You can also imagine this as a sweep-line algorithm from computation geometry, or an efficient variant of the 'merge' step from mergesort. The elements with parentheses represent the elements with which we'll initialize our potential "candidate solution" (in some central server). The algorithm will converge on the correct o responses by dumping the (x) answers for the two unselected os.
Algorithm:
All computers start as 'active'.
Each computer sorts its elements. (parallel O(N logN))
Repeat until all computers are inactive:
Each active computer finds the next-highest element (O(1) since sorted) and gives it to the central server.
The server smartly combines the new elements with the old K elements, and removes an equal number of the lowest elements from the combined set. To perform this step efficiently, we have a global priority queue of fixed size K. We insert the new potentially-better elements, and bad elements fall out of the set. Whenever an element falls out of the set, we tell the computer which sent that element to never send another one. (Justification: This always raises the smallest element of the candidate set.)
(sidenote: Adding a callback hook to falling out of a priority queue is an O(1) operation.)
We can see graphically that this will perform at most 2K*(findNextHighest_time + queueInsert_time) operations, and as we do so, elements will naturally fall out of the priority queue. findNextHighest_time is O(1) since we sorted the arrays, so to minimize 2K*queueInsert_time, we choose a priority queue with an O(1) insertion time (e.g. a Fibonacci-heap based priority queue). This gives us an O(log(queue_size)) extraction time (we cannot have O(1) insertion and extraction); however, we never need to use the extract operation! Once we are done, we merely dump the priority queue as an unordered set, which takes O(queue_size)=O(K) time.
We'd thus have O(N log(N) + K) total running time (parallel sorting, followed by O(K)*O(1) priority queue insertions). In the worst case of N=K, this is O(K log(K)).
The better approach - O(N+K) = O(K):
However I have come up with a better approach, which achieves O(K). It is based on the median-of-median selection algorithm, but parallelized. It goes like this:
We can eliminate a set of numbers if we know for sure that there are at least K (not strictly) larger numbers somewhere among all the computers.
Algorithm:
Each computer finds the sqrt(N)th highest element of its set, and splits the set into elements < and > it. This takes O(N) time in parallel.
The computers collaborate to combine those statistics into a new set, and find the K/sqrt(N)th highest element of that set (let's call it the 'superstatistic'), and note which computers have statistics < and > the superstatistic. This takes O(K) time.
Now consider all elements less than their computer's statistics, on computers whose statistic is less than the superstatistic. Those elements can be eliminated. This is because the elements greater than their computer's statistic, on computers whose statistic is larger than the superstatistic, are a set of K elements which are larger. (See the visual here).
Now, the computers with the uneliminated elements evenly redistribute their data to the computers who lost data.
Recurse: you still have K computers, but the value of N has decreased. Once N is less than a predetermined constant, use the previous algorithm I mentioned in "simple approach - O(NlogN + K)"; except in this case, it is now O(K). =)
It turns out that the reductions are O(N) total (amazingly not order K), except perhaps the final step which might by O(K). Thus this algorithm is O(N+K) = O(K) total.
Analysis and simulation of O(K) running time below. The statistics allow us to divide the world into four unordered sets, represented here as a rectangle divided into four subboxes:
------N-----
N^.5
________________
| | s | <- computer
| | #=K s REDIST. | <- computer
| | s | <- computer
| K/N^.5|-----S----------| <- computer
| | s | <- computer
K | s | <- computer
| | s ELIMIN. | <- computer
| | s | <- computer
| | s | <- computer
| |_____s__________| <- computer
LEGEND:
s=statistic, S=superstatistic
#=K -- set of K largest elements
(I'd draw the relation between the unordered sets of rows and s-column here, but it would clutter things up; see the addendum right now quickly.)
For this analysis, we will consider N as it decreases.
At a given step, we are able to eliminate the elements labelled ELIMIN; this has removed area from the rectangle representation above, reducing the problem size from K*N to , which hilariously simplifies to
Now, the computers with the uneliminated elements redistribute their data (REDIST rectangle above) to the computers with eliminated elements (ELIMIN). This is done in parallel, where the bandwidth bottleneck corresponds to the length of the short size of REDIST (because they are outnumbered by the ELIMIN computers which are waiting for their data). Therefore the data will take as long to transfer as the long length of the REDIST rectangle (another way of thinking about it: K/√N * (N-√N) is the area, divided by K/√N data-per-time, resulting in O(N-√N) time).
Thus at each step of size N, we are able to reduce the problem size to K(2√N-1), at the cost of performing N + 3K + (N-√N) work. We now recurse. The recurrence relation which will tell us our performance is:
T(N) = 2N+3K-√N + T(2√N-1)
The decimation of the subproblem size is much faster than the normal geometric series (being √N rather than something like N/2 which you'd normally get from common divide-and-conquers). Unfortunately neither the Master Theorem nor the powerful Akra-Bazzi theorem work, but we can at least convince ourselves it is linear via a simulation:
>>> def T(n,k=None):
... return 1 if n<10 else sqrt(n)*(2*sqrt(n)-1)+3*k+T(2*sqrt(n)-1, k=k)
>>> f = (lambda x: x)
>>> (lambda n: T((10**5)*n,k=(10**5)*n)/f((10**5)*n) - T(n,k=n)/f(n))(10**30)
-3.552713678800501e-15
The function T(N) is, at large scales, a multiple of the linear function x, hence linear (doubling the input doubles the output). This method, therefore, almost certainly achieves the bound of O(N) we conjecture. Though see the addendum for an interesting possibility.
...
Addendum
One pitfall is accidentally sorting. If we do anything which accidentally sorts our elements, we will incur a log(N) penalty at the least. Thus it is better to think of the arrays as sets, to avoid the pitfall of thinking that they are sorted.
Also we might initially think that with the constant amount of work at each step of 3K, so we would have to do work 3Klog(log(N)) work. But the -1 has a powerful role to play in the decimation of the problem size. It is very slightly possible that the running time is actually something above linear, but definitely much smaller than even Nlog(log(log(log(N)))). For example it might be something like O(N*InverseAckermann(N)), but I hit the recursion limit when testing.
The O(K) is probably only due to the fact that we have to print them out; if we are content merely knowing where the data is, we might even be able to pull off an O(N) (e.g. if the arrays are of length O(log(K)) we might be able to achieve O(log(K)))... but that's another story.
The relation between the unordered sets is as follows. Would have cluttered things up in explanation.
.
_
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
v
S
v
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
Find the k largest numbers on each machine. O(n*log(k))
Combine the results (on a centralized server, if k is not huge, otherwise you can merge them in a tree-hierarchy accross the server cluster).
Update: to make it clear, the combine step is not a sort. You just pick the top k numbers from the results. There are many ways to do this efficiently. You can use a heap for example, pushing the head of each list. Then you can remove the head from the heap and push the head from the list the element belonged to. Doing this k times gives you the result. All this is O(k*log(k)).
Maintain a min heap of size 'k' in the centralized server.
Initially insert first k elements into the min heap.
For the remaining elements
Check(peek) for the min element in the heap (O(1))
If the min element is lesser than the current element, then remove the min element from heap and insert the current element.
Finally min heap will have 'k' largest elements
This would require n(log k) time.
I would suggest something like this:
take the k largest numbers on each machine in sorted order O(Nk) where N is the number of element on each machine
sort each of these arrays of k elements by largest element (you will get k arrays of k elements sorted by largest element : a square matrix kxk)
take the "upper triangle" of the matrix made of these k arrays of k elements, (the k largest element will be in this upper triangle)
the central machine can now find the k largest element of these k(k+1)/2 elements
Let the machines find the out k largest elements copy it into a
datastructure (stack), sort it and pass it on to the Central
machine.
At the central machine receive the stacks from all the machine. Find
the greatest of the elements at the top of the stacks.
Pop out the greatest element form its stack and copy it to the 'TopK list'.
Leave the other stacks intact.
Repeat step 3, k times to get Top K numbers.
1) sort the items on every machine
2) use a k - binary heap on the central machine
a) populate the heap with first (max) element from each machine
b) extract the first element, and put back in the heap the first element from the machine that you extracted the element. (of course heapify your heap, after the element is added).
Sort will be O(Nlog(N)) where N is the max array on the machines.
O(k) - to build the heap
O(klog(k)) to extract and populate the heap k times.
Complexity is max(O(klog(k)),O(Nlog(N)))
I would think the MapReduce paradigm would be well suited to a task like this.
Every machine runs it's own independent map task to find the maximum value in its array (depends on the language used) and this will probably be O(N) complexity for N numbers on each machine.
The reduce task compares the result from the individual machines' outputs to give you the largest k numbers.

Resources