complexity of heap data structure

complexity of heap data structure - data-structures

I'm trying to count running time of build heap in heap sort algorithm
BUILD-HEAP(A)
heapsize := size(A);
for i := floor(heapsize/2) downto 1
do HEAPIFY(A, i);
end for
END
The basic idea behind why the time is linear is due to the fact that the time complexity of heapify depends on where it is within the heap. It takes O(1) time when the node is a leaf node (which makes up at least half of the nodes) and O(logn) time when it’s at the root.
The O(n) time can be proven by solving the following:
image by HostMath
what I understand here O(h) means worst case for heapify for each node, so height=ln n if the node is in the root for example to heapify node 2,1,3 it takes ln_2 3 =1.5 the height of root node 2 is 1, so the call to HEAPIFY is ln_2 n=height = O(h)
BUILD-HEAP(A)
heapsize := size(A);
for i := floor(heapsize/2) downto 1
do HEAPIFY(A, i);
end for
END
suppose this is the tree
4 .. height2
/ \
2 6 .. height 1
/\ /\
1 3 5 7 .. height 0
A quick look over the above algorithm suggests that the running time is O(nlg(n)), since each call to Heapify costs O(lg(n)) and Build-Heap makes O(n) such calls.
This upper bound, though correct, is not asymptotically tight.
the Time complexity for Building a Binary Heap is O(n).
im trying to understand, the heapsize/2 means for loop only call HEAPIFY heapsize/2 times. in tree above, heapsize=7, 7/2= 3 so root will be {1,2,6} so n/2
and every call to HEAPIFY will call HEAPIFY again until reach the last leaf of every root,
for example 2 will call heapify 1 times, 6 will call heapify 1 times, and 1 will call heapify 2 times. so it is the height of the tree which is ln n. am i right?
then the compelxity will be O(n/2 * ln n) = O(n ln n)
which one is right? O(n ln n) or O(n)?
and how can i get O(n)?
im reading this as reference , please correct me if im wrong thanks!
https://www.growingwiththeweb.com/data-structures/binary-heap/build-heap-proof/
this is the reference i used, and also i read about this in CLRS book
https://www.hostmath.com/Show.aspx?Code=ln_2%203

The complexity is O(n) here is why. Let's assume that the tree has n nodes. Since a heap is a nearly complete binary tree (according to CLRS), the second half of nodes are all leaves; so, there is no need to heapify them. Now for the remaining half. We start from node at position n/2 and go backwards. In heapifying, a node can only move downwards so, as you mentioned, it takes at most as much as the height of the node swap operations to complete the heapify for that node.
With n nodes, we have at most log n levels, where level 0 has the root and level 1 has at most 2 nodes and so on:
level 0: x
. / \
level 1: x x
.
level log n: x x x x x x x x
So, we have the following:
All nodes at level logn-1 need at most 1 swap for being heapified. (at most n/2 nodes here)
All nodes at level logn-2 need at most 2 swaps for being heapified. (at most n/4 nodes here)
....
All nodes at level 0 need at most logn swaps for being heapified. (at most 1 node here, i.e, the root)
So, the sum can be written as follows:
(1 x n/2 + 2 x n/4 + 3 x n/8 + ... + log n x n/2^logn)
Let's factor out n, we get:
n x (1/2 + 2/4 + 3/8 + ... + log n/2^logn)
Now the sum (1/2 + 2/4 + 3/8 + ... + log n/2^logn) is always <= 2 (see Sigma i over 2^i); therefore, the aforementioned sum we're interested in is always <= 2 x n. So, the complexity is O(n).

Related

What is the time complexity of this BFS algorithm?

I looked at LeetCode question 270. Perfext Squares:
Given an integer n, return the least number of perfect square numbers that sum to n.
A perfect square is an integer that is the square of an integer; in other words, it is the product of some integer with itself. For example, 1, 4, 9, and 16 are perfect squares while 3 and 11 are not.>
Example 1:
Input: n = 12
Output: 3
Explanation: 12 = 4 + 4 + 4.
I solved it using the following algorithm:
def numSquares(n):
squares = [i**2 for i in range(1, int(n**0.5)+1)]
step = 1
queue = {n}
while queue:
tempQueue = set()
for node in queue:
for square in squares:
if node-square == 0:
return step
if node < square:
break
tempQueue.add(node-square)
queue = tempQueue
step += 1
It basically tries to go from goal number to 0 by subtracting each possible number, which are : [1 , 4, 9, .. sqrt(n)] and then does the same work for each of the numbers obtained.
Question
What is the time complexity of this algorithm? The branching in every level is sqrt(n) times, but some branches are destined to end early... which makes me wonder how to derive the time complexity.

If you think about what you're doing, you can imagine that you're doing a breadth-first search over a graph with n + 1 nodes (all the natural numbers between 0 and n, inclusive) and some number of edges m, which we'll determine later on. Your graph is essentially represented as an adjacency list, since at each point you iterate over all the outgoing edges (squares less than or equal to your number) and stop as soon as you consider a square that's too large. As a result, the runtime will be O(n + m), and all we have to do now is work out what m is.
(There's another cost here in computing all the square roots up to and including n, but that takes time O(n1/2), which is dominated by the O(n) term.)
If you think about it, the number of outgoing edges from each number k will be given by the number of perfect squares less than or equal to k. That value is equal to ⌊√k⌋ (check this for a few examples - it works!). This means that the total number of edges is upper-bounded by
√0 + √1 + √2 + ... + √n
We can show that this sum is Θ(n3/2). First, we'll upper-bound this sum at O(n3/2), which we can do by noting that
√0 + √1 + √2 + ... + √n
≤ √n + √n + √ n + ... + √n (n+1) times
= (n + 1)√n
= O(n3/2).
To lower-bound this at Ω(n3/2), notice that
√0 + √1 + √2 + ... + √ n
≥ √(n/2) + √(n/2 + 1) + ... + √(n) (drop the first half of the terms)
≥ √(n/2) + √(n/2) + ... + √(n/2)
= (n / 2)√(n / 2)
= Ω(n3/2).
So overall, the number of edges is Θ(n3/2), so using a regular analysis of breadth-first search we can see that the runtime will be O(n3/2).
This bound is likely not tight, because this assumes that you visit every single node and every single edge, which isn't going to happen. However, I'm not sure how to tighten things much beyond this.
As a note - this would be a great place to use A* search instead of breadth-first search, since you can fairly easily come up with heuristics to underestimate the remaining total distance (say, take the number and divide it by the largest perfect square less than it). That would cause the search to focus on extremely promising paths that jump rapidly toward 0 before less-good paths, like, say, always taking steps of size one.
Hope this helps!

Some observations:
The number of squares up to n is √n (floored to the nearest integer)
After the first iteration of the while loop, tempQueue will have √n entries
tempQueue can never have more than n entries, since all these values are positive, less than n and unique.
Every natural number can be written as the sum of four integer squares. So that means your BFS algorithm's while loop will iterate at the most 4 times. If the return statement did not get executed during any of the first 3 iterations, it is guaranteed it will in the 4th.
Every statement (except for the initialisation of squares) runs in constant time, even the call to .add().
The initialisation of squares has a list comprehension loop that has √n iterations, and range runs in constant time, so that initialisation has a time complexity of O(√n).
Now we can set a ceiling to the number of times the if node-square == 0 statement is executed (or any other statement in the innermost loop's body):
1⋅√n + √n⋅√n + n⋅√n + n⋅√n
Each of the 4 terms corresponds to an iteration of the while loop. The left factor of each product corresponds to the maximum size of queue in that particular iteration, and the factor at the right corresponds to the size of squares (always the same). This simplifies to:
√n + n + 2n3⁄2
In terms of time complexity this is:
O(n3⁄2)
This is the worst case time complexity. When the while loop only has to iterate twice, it is O(n), and when only once (when n is a square), it is O(√n).

Heap-sort time complexity deep understanding

When I studied the Data Structures course in the university, I learned the following axioms:
Insertion of a new number to the heap takes O(logn) in worst case (depending on how high in the tree it reaches when inserted as a leaf)
Building a heap of n nodes, using n insertions, starting from an empty heap, is summed to O(n) time, using amortized analysis
Removal of the minimum takes O(logn) time in worst case (depending on how low the new top node reaches, after it was swapped with the last leaf)
Removal of all the minimums one by one, until the heap is empty, takes O(nlogn) time complexity
Reminder: The steps of "heapsort" algorithm are:
Add all the array values to a heap: summed to O(n) time complexity using the amortized-analysis trick
Pop the minimum out of the heap n times and place the i-th value in the i-th index of the array : O(nlogn) time complexity, as the amortized-analysis trick does not work when popping the minimum
My question is: Why the amortized-analysis trick does not work when emptying the heap, causing heap-sort algorithm to take O(nlogn) time and not O(n) time?
Edit/Answer
When the heap is stored in an array (rather than dynamic tree nodes with pointers), then we can build the heap bottom up, i.e., starting from the leaves and up to the root, then using amortized-analysis we can get total time complexity of O(n), whereas we cannot empty the heap minima's bottom up.

Assuming you're only allowed to learn about the relative ranking of two objects by comparing them, then there's no way to dequeue all elements from a binary heap in time O(n). If you could do this, then you could sort a list in time O(n) by building a heap in time O(n) and then dequeuing everything in time O(n). However, the sorting lower bound says that comparison sorts, in order to be correct, must have a runtime of Ω(n log n) on average. In other words, you can't dequeue from a heap too quickly or you'd break the sorting barrier.
There's also the question about why dequeuing n elements from a binary heap takes time O(n log n) and not something faster. This is a bit tricky to show, but here's the basic idea. Consider the first half of the dequeues you make on the heap. Look at the values that actually got dequeued and think about where they were in the heap to begin with. Excluding the ones on the bottom row, everything else that was dequeued had to percolate up to the top of the heap one swap at a time in order to be removed. You can show that there are enough elements in the heap to guarantee that this alone takes time Ω(n log n) because roughly half of those nodes will be deep in the tree. This explains why the amortized argument doesn't work - you're constantly pulling deep nodes up the heap, so the total distance the nodes have to travel is large. Compare this to the heapify operation, where most nodes travel very little distance.

Let me show you "mathematically" how we can compute the complexity of transforming an arbitrary array into an heap (let me call this "heap build") and then sorting it with heapsort.
Heap build time analysis
In order to transform the array into an heap, we have to look at each node with children and "heapify" (sink) that node. You should ask yourself how many compares we perform; if you think about it, you see that (h = tree height):
For each node at level i, we make h-i compares: #comparesOneNode(i) = h-i
At level i, we have 2^i nodes: #nodes(i) = 2^i
So, generally T(n,i) = #nodes(i) * #comparesOneNode(i) = 2^i *(h-i), is the time spent for "compares" at level "i"
Let's make an example. Suppose to have an array of 15 elements, i.e., the height of the tree would be h = log2(15) = 3:
At level i=3, we have 2^3=8 nodes and we make 3-3 compares for each node: correct, since at level 3 we have only nodes without children, i.e., leaves. T(n, 3) = 2^3*(3-3) = 0
At level i=2, we have 2^2=4 nodes and we make 3-2 compares for each node: correct, since at level 2 we have only level 3 with which we can compare. T(n, 2) = 2^2*(3-2) = 4 * 1
At level i=1, we have 2^1=2 nodes and we make 3-1 compares for each node: T(n, 1) = 2^1*(3-1) = 2 * 2
At level i=0, we have 2^0=1 node, the root, and we make 3-0 compares: T(n, 0) = 2^0*(3-0) = 1 * 3
Ok, generally:
T(n) = sum(i=0 to h) 2^i * (h-i)
but if you remember that h = log2(n), we have
T(n) = sum(i=0 to log2(n)) 2^i * (log2(n) - i) =~ 2n
Heapsort time analysis
Now, here the analysis is really similar. Every time we "remove" the max element (root), we move to root the last leaf in the tree, heapify it and repeat till the end. So, how many compares do we perform here?
At level i, we have 2^i nodes: #nodes(i) = 2^i
For each node at level "i", heapify, in the worst case, will always do the same number of compares that is exactly equal to the level "i" (we take one node from level i, move it to root, call heapify, and heapify in worst case will bring back the node to level i, performing"i" compares): #comparesOneNode(i) = i
So, generally T(n,i) = #nodes(i) * #comparesOneNode(i) = 2^i*i, is the time spent for removing the first 2^i roots and bring back to the correct position the temporary roots.
Let's make an example. Suppose to have an array of 15 elements, i.e., the height of the tree would be h = log2(15) = 3:
At level i=3, we have 2^3=8 nodes and we need to move each one of them to the root place and then heapify each of them. Each heapify will perform in worst case "i" compares, because the root could sink down to the still existent level "i". T(n, 3) = 2^3 * 3 = 8*3
At level i=2, we have 2^2=4 nodes and we make 2 compares for each node: T(n, 2) = 2^2*2 = 4 * 2
At level i=1, we have 2^1=2 nodes and we make 1 compare for each node: T(n, 1) = 2^1*1 = 2 * 1
At level i=0, we have 2^0=1 node, the root, and we make 0 compares: T(n, 0) = 0
Ok, generally:
T(n) = sum(i=0 to h) 2^i * i
but if you remember that h = log2(n), we have
T(n) = sum(i=0 to log2(n)) 2^i * i =~ 2nlogn
Heap build VS heapsort
Intuitively, you can see that heapsort is not able to "amortise" his cost because every time we increase the number of nodes, more compares we have to do, while we have exactly the opposite in the heap build functionality! You can see here:
Heap build: T(n, i) ~ 2^i * (h-i), if i increases, #nodes increases, but #compares decreases
Heapsort: T(n, i) ~ 2^i * i, if i increases, #nodes increases and #compares increases
So:
Level i=3, #nodes(3)=8, Heap build does 0 compares, heapsort does 8*3 = 24 compares
Level i=2, #nodes(2)=4, Heap build does 4 compares, heapsort does 4*2 = 8 compares
Level i=1, #nodes(1)=2, Heap build does 4 compares, heapsort does 2*1 = 2 compares
Level i=0, #nodes(0)=1, Heap build does 3 compares, heapsort does 1*0 = 1 compares

How come the height of recursion tree in merge sort lg(n)+1

I read few questions with answers as suggested by stackoveflow. I am following Introduction to Algorithm by cormen book for my self study. Everything is been explained clearly in that book but the only thing that is not explained is how to calculate height of tree in merge sort analysis.
I am still on chapter 2 haven't gone far if that is explained in later chapters.
I want to ask if the top most node is divided 2 times and so on. It gives me a height of ln(n) which is log2(n) what if i divide the main problem in five subproblems. Would it have been log5(n) then? Please explain how is this expressed in logarithm as well why not in some linear term?
Thanks

Recursion trees represent self-calls in recursive procedures. Typical mergsort calls itself twice, each call sorting half the input, so the recursion tree is a complete binary tree.
Observe that complete binary trees of increasing height display a pattern in their numbers of nodes:
height new "layer" total nodes
(h) of nodes (N)
1 1 1
2 2 3
3 4 7
4 8 15
...
Each new layer at level L has 2^L nodes (where level 0 is the root). So it's easy to see that total nodes N as a function of h is just
N = 2^h - 1
Now solve for h:
h = log_2 (N + 1)
If you have a 5-way split and merge, then each node in the tree has 5 children, not two. The table becomes:
height new "layer" total nodes
(h) of nodes (N)
1 1 1
2 5 6
3 25 31
4 125 156
...
Here we have N = (5^h - 1) / 4. Solving for h,
h = log_5 (4N + 1)
In general for a K-way merge, the tree has N = (K^h - 1) / (K - 1), so the height is given by
h = log_K ((K - 1)N + 1) = O(log N) [the log's base doesn't matter to big-O]
However, be careful. In K-way mergesort, selecting each element to merge requires Theta(log K) time. You can't ignore this cost either theoretically or in practice!

Proof that the height of a balanced binary-search tree is log(n)

The binary-search algorithm takes log(n) time, because of the fact that the height of the tree (with n nodes) would be log(n).
How would you prove this?

Now here I am not giving mathematical proof. Try to understand the problem using log to the base 2. Log2 is the normal meaning of log in computer science.
First, understand it is binary logarithm (log2n) (logarithm to the base 2).
For example,
the binary logarithm of 1 is 0
the binary logarithm of 2 is 1
the binary logarithm of 3 is 1
the binary logarithm of 4 is 2
the binary logarithm of 5, 6, 7 is 2
the binary logarithm of 8-15 is 3
the binary logarithm of 16-31 is 4 and so on.
For each height the number of nodes in a fully balanced tree are
Height Nodes Log calculation
0 1 log21 = 0
1 3 log23 = 1
2 7 log27 = 2
3 15 log215 = 3
Consider a balanced tree with between 8 and 15 nodes (any number, let's say 10). It is always going to be height 3 because log2 of any number from 8 to 15 is 3.
In a balanced binary tree the size of the problem to be solved is halved with every iteration. Thus roughly log2n iterations are needed to obtain a problem of size 1.
I hope this helps.

Let's assume at first that the tree is complete - it has 2^N leaf nodes. We try to prove that you need N recursive steps for a binary search.
With each recursion step you cut the number of candidate leaf nodes exactly by half (because our tree is complete). This means that after N halving operations there is exactly one candidate node left.
As each recursion step in our binary search algorithm corresponds to exactly one height level the height is exactly N.
Generalization to all balanced binary trees: If the tree has less nodes than 2^N we for sure don't need more halvings. We might need less or the same amount but never more.

Assuming that we have a complete tree to work with, we can say that at depth k, there are 2k nodes. You can prove this using simple induction, based on the intuition that adding an extra level to the tree will increase the number of nodes in the entire tree by the number of nodes that were in the previous level times two.
The height k of the tree is log(N), where N is the number of nodes. This can be stated as
log2(N) = k,
and it is equivalent to
N = 2k
To understand this, here's an example:
16 = 24 => log2(16) = 4
The height of the tree and the number of nodes are related exponentially. Taking the log of the number of nodes just allows you to work backwards to find the height.

Just look up the rigorous proof in Knuth, Volume 3 - Searching and Sorting Algorithms ... He does it far more rigorously than anyone else I can think of.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
You can find it in any good Computer Science library and on the bookshelves of many (very) old geeks.

Why is the height of a balanced binary tree equal to ceil(log2N) for N nodes?
w = width of base (maximum number of leaves)
h = height of tree (maximum number of edges from root to leaf)
Divide w by 2 (h times) to get to 1, which counts the single root node at top.
N = w + w/2 + ... + 1
N = 2h + ... + 21 + 20
= (1-2h+1) / (1-2) = 2h+1-1
log2(N+1) = h+1
Check: if N=1, h=0. If h=1, N=3.
This formula is for if the bottom level is full. N will not always be so great, but would still have the same height, h. So we must take the log's ceiling.

What does Logn actually mean?

I am just studying for my class in Algorithms and have been looking over QuickSort. I understand the algorithm and how it works, but not how to get the number of comparisons it does, or what logn actually means, at the end of the day.
I understand the basics, to the extent of :
x=logb(Y) then
b^x = Y
But what does this mean in terms of algorithm performance? It's the number of comparisons you need to do, I understand that...the whole idea just seems so unintelligible though. Like, for QuickSort, each level K invocation involves 2^k invocations each involving sublists of length n/2^K.
So, summing to find the number of comparisons :
log n
Σ 2^k. 2(n/2^k) = 2n(1+logn)
k=0
Why are we summing up to log n ? Where did 2n(1+logn) come from? Sorry for the vagueness of my descriptions, I am just so confused.

If you consider a full, balanced binary tree, then layer by layer you have 1 + 2 + 4 + 8 + ... vertices. If the total number of vertices in the tree is 2^n - 1 then you have 1 + 2 + 4 + 8 + ... + 2^(n-1) vertices, counting layer by layer. Now, let N = 2^n (the size of the tree), then the height of the tree is n, and n = log2(N) (the height of the tree). That's what the log(n) means in these Big O expressions.

below is a sample tree:
1
/ \
2 3
/ \ / \
4 5 6 7
number of nodes in tree is 7 but high of tree is log 7 = 3, log comes when you have divide and conquer methods, in quick sort you divide list into 2 sublist, and continue this until rich small lists, divisions takes logn time (in average case), because the high of division is log n, partitioning in each level takes O(n), because in each level in average you partition N numbers, (may be there are too many list for partitioning, but average number of numbers is N in each level, in fact some of count of lists is N). So for simple observation if you have balanced partition tree you have log n time for partitioning, which means high of tree.

1 forget about b-trees for sec
here' math : log2 N = k is same 2^k=N .. its the definion of log
, it could be natural log(e) N = k aka e^k = n,, or decimal log10 N = k is 10^k = n
2 see perfect , balanced tree
1
1+ 1
1 + 1 + 1+ 1
8 ones
16 ones
etc
how many elements? 1+2+4+8..etc , so for 2 level b-tree there are 2^2-1 elements, for 3 level tree 2^3-1 and etc.. SO HERE'S MAGIC FORMULA: N_TREE_ELEMENTS= number OF levels^ 2 -1 ,or using definition of log : log2 number OF levels= number_of_tree_elements (Can forget about -1 )
3 lets say there's a task to find element in N elements b-tree, w/ K levels (aka height)
where how b-tree is constructed log2 height = number_of_tree elements
MOST IMPORTANT
so by how b-tree is constructed you need no more then 'height' OPERATIONS to find element in all N elements , or less.. so WHAT IS HEIGHT equals : log2 number_of_tree_elements..
so you need log2 N_number_of_tree_elements.. or log(N) for shorter

To understand what O(log(n)) means you might wanna read up on Big O notaion. In shot it means, that if your data set gets 1024 times bigger you runtime will only be 10 times longer (or less)(for base 2).
MergeSort runs in O(n*log(n)), which means it will take 10 240 times longer. Bubble sort runs in O(n^2), which means it will take 1024^2 = 1 048 576 times longer. So there are really some time to safe :)
To understand your sum, you must look at the mergesort algorithm as a tree:
sort(3,1,2,4)
/ \
sort(3,1) sort(2,4)
/ \ / \
sort(3) sort(1) sort(2) sort(4)
The sum iterates over each level of the tree. k=0 it the top, k= log(n) is the buttom. The tree will always be of height log2(n) (as it a balanced binary tree).
To do a little math:
Σ 2^k * 2(n/2^k) =
2 * Σ 2^k * (n/2^k) =
2 * Σ n*2^k/2^k =
2 * Σ n =
2 * n * (1+log(n)) //As there are log(n)+1 steps from 0 to log(n) inclusive
This is of course a lot of work to do, especially if you have more complex algoritms. In those situations you get really happy for the Master Theorem, but for the moment it might just get you more confused. It's very theoretical so don't worry if you don't understand it right away.

For me, to understand issues like this, this is a good way to think about it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio