how to Arrange the below data structures in ascending order of the time complexity required for inserts in average case scenario.
1. Sorted Array
2. Hash Table
3. Binary Search Tree
4. B+ Tree
In this answer, I will give you a starters on each data structure, and let you complete the rest on your own.
Sorted Array: In a sorted array of size k, the problem with each
insertion is you are first need to find the index i where the
element should be inserted (easy), and then shift all elements
i,i+1,...,k to the right in order to "make place" for the new
element. This takes O(k) time, and it's actually k/2 moves on average.
So, the average complexity to insert elements to a sorted array is 1/2 + 2/2 + 3/3 + ... + n/2 = (1+...+n)/2.
Use sum of arithmetic progression to see what is its complexity.
A hash table offers O(1) Average amortized case performance for inserting elements. What happens when you do n operations, each O(1)? What will be the total coplexity?
In a Binary Search Tree (BST), each operation is O(h), where h is the current height of the tree. Luckily, when adding elements at random to a binary search tree (even non self balancing) its average height is still O(logn).
So, to get the complexity of adding all elements, you need to sum Some_Const*(log(1) + log(2) + ...+ log(n))
See hint at the end
Similarly to a BST, a B+ tree also takes O(h) time per insertion. Difference is, h is bounded to be logarithimic as well even in worst case. So, the calculation of time complexity is going to remain Some_Other_Const*(log(1) + log(2) + .. + log(n)) when calculating average case.
Hints:
log(x) + log(y) = log(x*y)
log(n!) is in O(nlogn)
Related
According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.
A way of finding the median of a given set of n numbers is to distribute them among 2 heaps. 1 is a max-heap containing the lower n/2 (ceil(n/2)) numbers and a min-heap containing the rest. If maintained in this way the median is the max of the first heap (along with the min of the second heap if n is even). Here's my c++ code that does this:
priority_queue<int, vector<int> > left;
priority_queue<int,vector<int>, greater<int> > right;
cin>>n; //n= number of items
for (int i=0;i<n;i++) {
cin>>a;
if (left.empty())
left.push(a);
else if (left.size()<=right.size()) {
if (a<=right.top())
left.push(a);
else {
left.push(right.top());
right.pop();
right.push(a);
}
}
else {
if (a>=left.top())
right.push(a);
else {
right.push(left.top());
left.pop();
left.push(a);
}
}
}
We know that the heapify operation has linear complexity . Does this mean that if we insert numbers one by one into the two heaps as in the above code, we are finding the median in linear time?
Linear time heapify is for the cost of building a heap from an unsorted array as a batch operation, not for building a heap by inserting values one at a time.
Consider a min heap where you are inserting a stream of values in increasing order. The value at the top of the heap is the smallest, so each value trickles all the way down to the bottom of the heap. Consider just the last half of the values inserted. At this time the heap will have very nearly its full height, which is log(n), so each value trickles down log(n) slots, and the cost of inserting n/2 values is O(n log(n))
If I present a stream of values in increasing order to your median finding algorithm one of the things it has to do is build a min heap from a stream of values in increasing order so the cost of the median finding is O(n log(n)). In, fact the max heap is going to be doing a lot of deletes as well as insertions, but this is just a constant factor on top so I think the overall complexity is still O(n log(n))
When there is one element, the complexity of the step is Log 1 because of a single element being in a single heap.
When there are two elements, the complexity of the step is Log 1 as we have one element in each heap.
When there are four elements, the complexity of the step is Log 2 as we have two elements in each heap.
So, when there are n elements, the complexity is Log n as we have n/2 elements in each heap and
adding an element; as well as,
removing element from one heap and adding it to another;
takes O(Log n/2) = O(Log n) time.
So for keeping track of median of n elements essentially is done by performing:
2 * ( Log 1 + Log 2 + Log 3 + ... + Log n/2 ) steps.
The factor of 2 comes from performing the same step in 2 heaps.
The above summation can be handled in two ways. One way gives a tighter bound but it is encountered less frequently in general. Here it goes:
Log a + Log b = Log a*b (By property of logarithms)
So, the summation is actually Log ((n/2)!) = O(Log n!).
The second way is:
Each of the values Log 1, Log 2, ... Log n/2 is less than or equal to Log n/2
As there are a total n/2 terms, the summation is less than (n/2) * Log (n/2)
This implies the function is upper bound by (n/2) * Log (n/2)
Or, the complexity is O(n * Log n).
The second bound is looser but more well known.
This is a great question, especially since you can find the median of a list of numbers in O(N) time using Quickselect.
But the dual priority-queue approach gives you O(N log N) unfortunately.
Riffing in binary heap wiki article here, heapify is a bottom-up operation. You have all the data in hand and this allows you to be cunning and reduce the number of swaps/comparisons to O(N). You can build an optimal structure from the get-go.
Adding elements from the top, one at a time, as you are doing here, requires reorganizing every time. That's expensive so the whole operation ends up being O(N log N).
The algorithm of the Quicksort is:
Quicksort(A,p,r)
if p<r then
q<- partition(A,p,r)
Quicksort(A,p,q-1)
Quicksort(A,q+1,r)
According to my notes,the cost of Quicksort(A,1,n) is T(n)=T(q)+T(n-q)+ cost of partition.
Why is the cost like that and not : T(n)=T(q-1)+T(n-q)+cost of partition?
And also why is the cost of the worst case T(n)=T(n-1)+Θ(n) ?
I'm more confident about the answer to your second question.
In the worst case, the pivot can always turn out to be the lowest number (or the highest number) in the array. In that case, the divided arrays shall be of length n-1 and 0 respectively. Hence the recurrence relation shall be:
T(n)= T(n-1)+T(0) + Work done for partition
= T(n-1) + 0 + O(n)
For example in the worst case if the array is originally sorted in ascended order and you decide to choose the 1st element as the pivot always.
Initial Array: {1, 2, 3, 4, 5}
Pivot Element: 1.
Partitioned arrays: {} and {2,3,4,5}
Next pivot element: 2
Partitioned arrays: {} {3,4,5}
...
Here you can see that at each partition, the size of problem decreases by just 1 and not by a factor of half.
Hence T(n) = T(n-1) + Work done for partitioning( O(n) )
Only the terms with the highest indices are considered when performing time complexity analysis. This is because only the terms with the highest indices remain relevant as the input gets larger. For example: O(0.0001n^3 + 0.002n^2 + 0.1n + 1000000) = O(n^3). It follows that T(q-1) = T(q), since -1 is irrelevant for large values of q.
I am not sure if your note is entirely accurate. user1990169 has kindly answered why the general Quicksort has the worst case time complexity of O(n^2), but it's actually possible to spend O(n) time to determine the median in an unsorted array of n elements, meaning we can always pick the median value (the best value) for the pivot in each iteration. The time complexity of T(n)=T(n-1)+Θ(n) may result from an array where all elements have the same value, in which case, depending on implementation, all elements other than the pivot may get put into the LEFT partition or the RIGHT partition. However, even this can be avoided with some clever implementation. Thus the complexity analysis of T(n)=T(n-1)+Θ(n) may be from a specific implementation of Quicksort, rather than an optimal one.
I am little bit confused regarding worst Case time and Avg case Time complexity. My source of confusion is Here
My aim is to short data in increasing Order: I choose BST to acomplish my task of sorting.Here I am putting what I am doing for printing data in Increasing order.
1) Construct a binary search tree for given input.
Time complexity: Avg Case O(log n)
Worst Case O(H) {H is height of tree, here we can Assume Height is equal to number of node H = n}
2)After Finishing first work I am traversing BST in Inorder to print data in Increasing order.
Time complexity: O(n) {n is the number of nodes in tree}
Now I analyzed total complexity for get my desire result (data in increasing order) is for Avg Case: T(n) = O(log n) +O(n) = max(log n, n) = O(n)
For Worst Case : T(n) = O(n) +O(n) = max(n, n) = O(n)
Above point was my understanding which is Differ from Above Link concept. I know I am doing some wrong interpratation Please correct me. I would appreciate your suggestion and thought.
Please Refer this title Under Slide which I have mentined:
In (1) you provide the time per element, you need to multiply with the # of elements.
The time complexity needed to construct the binary tree is n times the complexity you suggest as you need to insert each node.
According to Wikipedia, partition-based selection algorithms such as quickselect have runtime of O(n), but I am not convinced by it. Can anyone explain why it is O(n)?
In the normal quick-sort, the runtime is O(n log n). Every time we partition the branch into two branches (greater than the pivot and lesser than the pivot), we need to continue the process in both branches, whereas quickselect only needs to process one branch. I totally understand these points.
However, if you think in the Binary Search algorithm, after we chose the middle element, we are also searching only one side of the branch. So does that make the algorithm O(1)? No, of course, the Binary Search Algorithm is still O(log N) instead of O(1). This is also the same thing as the search element in a Binary Search Tree. We only search for one side, but we still consider O(log n) instead of O(1).
Can someone explain why in quickselect, if we continue the search in one side of pivot, it is considered O(1) instead of O(log n)? I consider the algorithm to be O(n log n), O(N) for the partitioning, and O(log n) for the number of times to continue finding.
There are several different selection algorithms, from the much simpler quickselect (expected O(n), worst-case O(n2)) to the more complex median-of-medians algorithm (Θ(n)). Both of these algorithms work by using a quicksort partitioning step (time O(n)) to rearrange the elements and position one element into its proper position. If that element is at the index in question, we're done and can just return that element. Otherwise, we determine which side to recurse on and recurse there.
Let's now make a very strong assumption - suppose that we're using quickselect (pick the pivot randomly) and on each iteration we manage to guess the exact middle of the array. In that case, our algorithm will work like this: we do a partition step, throw away half of the array, then recursively process one half of the array. This means that on each recursive call we end up doing work proportional to the length of the array at that level, but that length keeps decreasing by a factor of two on each iteration. If we work out the math (ignoring constant factors, etc.) we end up getting the following time:
Work at the first level: n
Work after one recursive call: n / 2
Work after two recursive calls: n / 4
Work after three recursive calls: n / 8
...
This means that the total work done is given by
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...)
Notice that this last term is n times the sum of 1, 1/2, 1/4, 1/8, etc. If you work out this infinite sum, despite the fact that there are infinitely many terms, the total sum is exactly 2. This means that the total work is
n + n / 2 + n / 4 + n / 8 + n / 16 + ... = n (1 + 1/2 + 1/4 + 1/8 + ...) = 2n
This may seem weird, but the idea is that if we do linear work on each level but keep cutting the array in half, we end up doing only roughly 2n work.
An important detail here is that there are indeed O(log n) different iterations here, but not all of them are doing an equal amount of work. Indeed, each iteration does half as much work as the previous iteration. If we ignore the fact that the work is decreasing, you can conclude that the work is O(n log n), which is correct but not a tight bound. This more precise analysis, which uses the fact that the work done keeps decreasing on each iteration, gives the O(n) runtime.
Of course, this is a very optimistic assumption - we almost never get a 50/50 split! - but using a more powerful version of this analysis, you can say that if you can guarantee any constant factor split, the total work done is only some constant multiple of n. If we pick a totally random element on each iteration (as we do in quickselect), then on expectation we only need to pick two elements before we end up picking some pivot element in the middle 50% of the array, which means that, on expectation, only two rounds of picking a pivot are required before we end up picking something that gives a 25/75 split. This is where the expected runtime of O(n) for quickselect comes from.
A formal analysis of the median-of-medians algorithm is much harder because the recurrence is difficult and not easy to analyze. Intuitively, the algorithm works by doing a small amount of work to guarantee a good pivot is chosen. However, because there are two different recursive calls made, an analysis like the above won't work correctly. You can either use an advanced result called the Akra-Bazzi theorem, or use the formal definition of big-O to explicitly prove that the runtime is O(n). For a more detailed analysis, check out "Introduction to Algorithms, Third Edition" by Cormen, Leisserson, Rivest, and Stein.
Let me try to explain the difference between selection & binary search.
Binary search algorithm in each step does O(1) operations. Totally there are log(N) steps and this makes it O(log(N))
Selection algorithm in each step performs O(n) operations. But this 'n' keeps on reducing by half each time. There are totally log(N) steps.
This makes it N + N/2 + N/4 + ... + 1 (log(N) times) = 2N = O(N)
For binary search it is 1 + 1 + ... (log(N) times) = O(logN)
In Quicksort, the recursion tree is lg(N) levels deep and each of these levels requires O(N) amount of work. So the total running time is O(NlgN).
In Quickselect, the recurision tree is lg(N) levels deep and each level requires only half the work of the level above it. This produces the following:
N * (1/1 + 1/2 + 1/4 + 1/8 + ...)
or
N * Summation(1/i^2)
1 < i <= lgN
The important thing to note here is that i goes from 1 to lgN, but not from 1 to N and also not from 1 to infinity.
The summation evaluates to 2. Hence Quickselect = O(2N).
Quicksort does not have a big-O of nlogn - it's worst case runtime is n^2.
I assume you're asking about Hoare's Selection Algorithm (or quickselect) not the naive selection algorithm that is O(kn). Like quicksort, quickselect has a worst case runtime of O(n^2) (if bad pivots are chosen), not O(n). It can run in expectation time n because it's only sorting one side, as you point out.
Because for selection, you're not sorting, necessarily. You can simply count how many items there are which have any given value. So an O(n) median can be performed by counting how many times each value comes up, and picking the value that has 50% of items above and below it. It's 1 pass through the array, simply incrementing a counter for each element in the array, so it's O(n).
For example, if you have an array "a" of 8 bit numbers, you can do the following:
int histogram [ 256 ];
for (i = 0; i < 256; i++)
{
histogram [ i ] = 0;
}
for (i = 0; i < numItems; i++)
{
histogram [ a [ i ] ]++;
}
i = 0;
sum = 0;
while (sum < (numItems / 2))
{
sum += histogram [ i ];
i++;
}
At the end, the variable "i" will contain the 8-bit value of the median. It was about 1.5 passes through the array "a". Once through the entire array to count the values, and half through it again to get the final value.