Min-heap vs Self Balancing BST to maintain top N Elements - data-structures

We are using a Map<CompanyId, companyObject> to store companies. And companyObject has a variable totalStocksTradedToday according to which min-Heap<CompanyObject> maintains top k companies with maximum trading in the market that day. Live input of stocks being traded for companies is coming in. So accordingly we update the min-Heap. Time complexity of a updating min-heap is
O(k) = O(1)(find Min element) + O(k)(search if company already exist on heap) + O(log k)(insert in min-heap)
Is there a way to improve on this time complexity ? What if we use Self Balancing BST in place of min-Heap. Time complexity will be
O(logK) = O(log k)(find min) + O(log k)(searching) + O(log k)(insertion)

Related

How to choose the least number of weights to get a total weight in O(n) time

If there are n unsorted weights and I need to find the least number of weights to get at least weight W.
How do I find them in O(n)?
This problem has many solution methods:
Method 1 - Sorting - O(nlogn)
I guess that the most trivial one would be to sort in descending order and then to take the first K elements that give a sum of at least W. The time complexity will be though O(nlogn).
Method 2 - Max Heap - O(n + klogn)
Another method would be to use a max heap.
Creating the heap will take O(n) and then extracting elements until we got to a total sum of at least W. Each extraction will take O(logn) so the total time complexity will be O(klogn) where k is the number of elements we had to extract from the heap.
Method 3 - Using Min Heap - O(nlogk)
Adding this method that JimMischel suggested in the comments below.
Creating a min heap with the first k elements in the list that sums to at least W. Then, iterate over the remaining elements and if it's greater than the minimum (heap top) replace between them.
At this point, it might be that we have more elements of what we actually need to get to W, so we will just extract the minimums until we reach our limit. In practice, depending on the relation between
find_min_set(A,W)
currentW = 0
heap H //Create empty heap
for each Elem in A
if (currentW < W)
H.add(Elem)
currentW += Elem
else if (Elem > H.top())
currentW += (Elem-H.top())
H.pop()
H.add(Elem)
while (currentW-H.top() > W)
currentW -= H.top()
H.pop()
This method might be even faster in practice, depending on the relation between k and n. See when theory meets practice.
Method 4 - O(n)
The best method I could think of will be using some kind of quickselect while keeping track of the total weight and always partitioning with the median as a pivot.
First, let's define few things:
sum(A) - The total sum of all elements in array A.
num(A) - The number of elements in array A.
med(A) - The median of the array A.
find_min_set(A,W,T)
//partition A
//L contains all the elements of A that are less than med(A)
//R contains all the elements of A that are greater or equal to med(A)
L, R = partition(A,med(A))
if (sum(R)==W)
return T+num(R)
if (sum(R) > W)
return find_min_set(R,W,T)
if (sum(R) < W)
return find_min_set(L,W-sum(R),num(R)+T)
Calling this method by find_min_set(A,W,0).
Runtime Complexity:
Finding median is O(n).
Partitioning is O(n).
Each recursive call is taking half of the size of the array.
Summing it all up we get a follow relation: T(n) = T(n/2) + O(n) which is same as the average case of quickselect = O(n).
Note: When all values are unique both worst-case and average complexity is indeed O(n). With possible duplicates values, the average complexity is still O(n) but the worst case is O(nlogn) with using Median of medians method for selecting the pivot.

Complexity of finding the median using 2 heaps

A way of finding the median of a given set of n numbers is to distribute them among 2 heaps. 1 is a max-heap containing the lower n/2 (ceil(n/2)) numbers and a min-heap containing the rest. If maintained in this way the median is the max of the first heap (along with the min of the second heap if n is even). Here's my c++ code that does this:
priority_queue<int, vector<int> > left;
priority_queue<int,vector<int>, greater<int> > right;
cin>>n; //n= number of items
for (int i=0;i<n;i++) {
cin>>a;
if (left.empty())
left.push(a);
else if (left.size()<=right.size()) {
if (a<=right.top())
left.push(a);
else {
left.push(right.top());
right.pop();
right.push(a);
}
}
else {
if (a>=left.top())
right.push(a);
else {
right.push(left.top());
left.pop();
left.push(a);
}
}
}
We know that the heapify operation has linear complexity . Does this mean that if we insert numbers one by one into the two heaps as in the above code, we are finding the median in linear time?
Linear time heapify is for the cost of building a heap from an unsorted array as a batch operation, not for building a heap by inserting values one at a time.
Consider a min heap where you are inserting a stream of values in increasing order. The value at the top of the heap is the smallest, so each value trickles all the way down to the bottom of the heap. Consider just the last half of the values inserted. At this time the heap will have very nearly its full height, which is log(n), so each value trickles down log(n) slots, and the cost of inserting n/2 values is O(n log(n))
If I present a stream of values in increasing order to your median finding algorithm one of the things it has to do is build a min heap from a stream of values in increasing order so the cost of the median finding is O(n log(n)). In, fact the max heap is going to be doing a lot of deletes as well as insertions, but this is just a constant factor on top so I think the overall complexity is still O(n log(n))
When there is one element, the complexity of the step is Log 1 because of a single element being in a single heap.
When there are two elements, the complexity of the step is Log 1 as we have one element in each heap.
When there are four elements, the complexity of the step is Log 2 as we have two elements in each heap.
So, when there are n elements, the complexity is Log n as we have n/2 elements in each heap and
adding an element; as well as,
removing element from one heap and adding it to another;
takes O(Log n/2) = O(Log n) time.
So for keeping track of median of n elements essentially is done by performing:
2 * ( Log 1 + Log 2 + Log 3 + ... + Log n/2 ) steps.
The factor of 2 comes from performing the same step in 2 heaps.
The above summation can be handled in two ways. One way gives a tighter bound but it is encountered less frequently in general. Here it goes:
Log a + Log b = Log a*b (By property of logarithms)
So, the summation is actually Log ((n/2)!) = O(Log n!).
The second way is:
Each of the values Log 1, Log 2, ... Log n/2 is less than or equal to Log n/2
As there are a total n/2 terms, the summation is less than (n/2) * Log (n/2)
This implies the function is upper bound by (n/2) * Log (n/2)
Or, the complexity is O(n * Log n).
The second bound is looser but more well known.
This is a great question, especially since you can find the median of a list of numbers in O(N) time using Quickselect.
But the dual priority-queue approach gives you O(N log N) unfortunately.
Riffing in binary heap wiki article here, heapify is a bottom-up operation. You have all the data in hand and this allows you to be cunning and reduce the number of swaps/comparisons to O(N). You can build an optimal structure from the get-go.
Adding elements from the top, one at a time, as you are doing here, requires reorganizing every time. That's expensive so the whole operation ends up being O(N log N).

Data Structure algorithm

how to Arrange the below data structures in ascending order of the time complexity required for inserts in average case scenario.
1. Sorted Array
2. Hash Table
3. Binary Search Tree
4. B+ Tree
In this answer, I will give you a starters on each data structure, and let you complete the rest on your own.
Sorted Array: In a sorted array of size k, the problem with each
insertion is you are first need to find the index i where the
element should be inserted (easy), and then shift all elements
i,i+1,...,k to the right in order to "make place" for the new
element. This takes O(k) time, and it's actually k/2 moves on average.
So, the average complexity to insert elements to a sorted array is 1/2 + 2/2 + 3/3 + ... + n/2 = (1+...+n)/2.
Use sum of arithmetic progression to see what is its complexity.
A hash table offers O(1) Average amortized case performance for inserting elements. What happens when you do n operations, each O(1)? What will be the total coplexity?
In a Binary Search Tree (BST), each operation is O(h), where h is the current height of the tree. Luckily, when adding elements at random to a binary search tree (even non self balancing) its average height is still O(logn).
So, to get the complexity of adding all elements, you need to sum Some_Const*(log(1) + log(2) + ...+ log(n))
See hint at the end
Similarly to a BST, a B+ tree also takes O(h) time per insertion. Difference is, h is bounded to be logarithimic as well even in worst case. So, the calculation of time complexity is going to remain Some_Other_Const*(log(1) + log(2) + .. + log(n)) when calculating average case.
Hints:
log(x) + log(y) = log(x*y)
log(n!) is in O(nlogn)

Time complexity for finding the diameter of a binary tree

I have seen various posts here that computes the diameter of a binary tree. One such solution can be found here (Look at the accepted solution, NOT the code highlighted in the problem).
I'm confused why the time complexity of the code would be O(n^2). I don't see how traversing the nodes of a tree twice (once for the height (via getHeight()) and once for the diameter (via getDiameter()) would be n^2 instead of n+n which is 2n. Any help would be appreciated.
As you mentioned, the time complexity of getHeight() is O(n).
For each node, the function getHeight() is called. So the complexity for a single node is O(n). Hence the complexity for the entire algorithm (for all nodes) is O(n*n).
It should be O(N) to calculate the height of every subtree rooted at every node, you only have to traverse the tree one time using an in-order traversal.
int treeHeight(root)
{
if(root == null) return -1;
root->height = max(treeHeight(root->rChild),treeHeight(root->lChild)) + 1;
return root->height;
}
This will visit each node 1 time, so has order O(N).
Combine this with the result from the linked source, and you will be able to determine which 2 nodes have the longest path between in at worst another traversal.
Indeed this describes the way to do it in O(N)
The different between this solution (the optimized one) and the referenced one is that the referenced solution re-computes tree height every time after shrinking the search size by only 1 node (the root node). Thus from above the complexity will be O(N + (N - 1) + ... + 1).
The sum
1 + 2 + ... + N
is equal to
= N(N + 1)/2
And so the complexity of sum of all the operations from the repeated calls to getHeight will be O(N^2)
For completeness sake, conversely, the optimized solution getHeight() will have complexity O(1) after the pre computation because each node will store the value as a data member of the node.
All subtree heights may be precalculated (using O(n) time), so what total time complexity of finding the diameter would be O(n).

Sorting By:Binary Search Tree

I am little bit confused regarding worst Case time and Avg case Time complexity. My source of confusion is Here
My aim is to short data in increasing Order: I choose BST to acomplish my task of sorting.Here I am putting what I am doing for printing data in Increasing order.
1) Construct a binary search tree for given input.
Time complexity: Avg Case O(log n)
Worst Case O(H) {H is height of tree, here we can Assume Height is equal to number of node H = n}
2)After Finishing first work I am traversing BST in Inorder to print data in Increasing order.
Time complexity: O(n) {n is the number of nodes in tree}
Now I analyzed total complexity for get my desire result (data in increasing order) is for Avg Case: T(n) = O(log n) +O(n) = max(log n, n) = O(n)
For Worst Case : T(n) = O(n) +O(n) = max(n, n) = O(n)
Above point was my understanding which is Differ from Above Link concept. I know I am doing some wrong interpratation Please correct me. I would appreciate your suggestion and thought.
Please Refer this title Under Slide which I have mentined:
In (1) you provide the time per element, you need to multiply with the # of elements.
The time complexity needed to construct the binary tree is n times the complexity you suggest as you need to insert each node.

Resources