Insertion into a Binary Heap: Number of exchanges in worst case - algorithm

I was going through Cormen's 'Algorithms Unlocked'. In chapter 6 on shortest path algorithms, on inserting data into a binary heap, I find this: "Since the path to the root has at most floor(lg(n)) edges, at most floor(lg(n))-1 exchanges occur, and so INSERT takes O(lg(n)) time." Now, I know the resulting complexity of insertion in a binary heap is as mentioned, but about the number of exchanges in the worst case, should it not be floor(lg(n)) instead of floor(lg(n))-1. The book's errata says nothing regarding this. So I was wondering if I missed something.
Thanks and Regards,
Aditya

You can easily show that it's floor(lg(n)). Consider this binary heap:
3
5 7
To insert the value 1, you first add it to the end of the heap:
3
5 7
1
So there are 4 items in the heap. It's going to take two swaps to move the item 1 to the root. floor(lg(4)) is equal to 2.

floor(lg(n)) is the correct expression for the maximum number of edges on a path between a leaf and the root, and when you do swaps, you may end up doing one swap for each edge. So floor(lg(n)) is the correct answer for the worst-case number of swaps. The author most likely confused the number of edges on the path with the number of VERTICES on the path when they were writing. If you have V vertices on the path between the leaf and the root, then V-1 is the number of edges so V-1 is the number of swaps you might do in the worst-case.

Related

How do you find the minimal set of swaps that will sort a known list?

It is known that the problem of sorting an unknown lists can't be done in less than N * log(N) steps in average. But what about the problem of finding the optimal sorting for a known list? That is, suppose you have the following list:
[1,3,2,7,4]
On that case, only 2 swaps leave it sorted:
swap 1 2
swap 3 4
Which is much less than 5 * log 5. How do I find the minimal set of swaps that will leave a specific list sorted?
Note: this question is very similar to my previous one, except without stack machines.
This question becomes much easier once you turn the permutation into a cycle decomposition.
For your example, the cycle decomposition using zero-based indices is (0)(2 1)(4 3). Each cycle of length k will require k-1 swaps to place into the correct order, so the answer to the minimal set of swaps is the sum of (cycle length - 1) for each cycle, and the exact set of swaps is determined from identifying the cycles and switching each element in the cycle with the next one in the cycle.
The complexity of this approach is O(nlogn) to find the rank of each element plus O(n) to find the cycle decomposition.
This answer assumes you can swap an arbitrary pair of elements.
If you can only swap adjacent elements you need to compute the number of inversions in the array, see counting inversions.

What is the space requirement of many trees?

I was asked the space requirements of my project and I wasn't sure about the answer, so I am asking here. Here is what I do:
I am building a number of perfect binary trees (let's say m).
Every leaf indexes one point (i.e. that we keep the data stored and
every tree indexes the points that are stored there).
My thought is that the space requirement is: O(n*d + m), where n is the number of points, d is the dimension of a point and m is the number of the trees, but I am telling this by experience, not sure if I fully understand it!
Can anyone help?
To be honest, every leaf contains a number of points, p, but I think that I will be able to work out the result, if I will get an answer to my question above.
In a perfect binary tree with n leaves, the total number of nodes is 2n - 1, which is O(n). More generally, if you have a collection of perfect binary trees with n total leaves, the total number of nodes will be 2n - 1. Therefore, if each of the n leaf nodes stores a d-dimensional point, the total space usage is O(nd).
The number of trees m here actually doesn't need to show up in the big-O space analysis. Think of it this way: if you have m trees, assuming each is nonempty, then you have to have at least n leaf nodes (at least one per tree), so we know that m = O(n). Therefore, even if you do account for the space overhead per tree as O(m), the total space usage of O(nd + m) is equivalent to O(nd).
Hope this helps!

Huffman code: shortest and longest code for Fibonacci frequencies

What would be the length of the shortest code and the longest Huffman code for n characters with Fibonacci frequencies?
From what I understand - if we build a tree, it will look like a one branch with each node of length 1 hanging off, from the root to the lowest leaf. When we create the first node out of n-2 numbers, this node's frequency will be F[n]-1, and F[n]>F[n]-1>F[n-1]. (F[n-1] being the least remaining and F[n] will be the second least remaining), which, by induction, would apply to all the frequencies.
The tree we create is clearly an unbalanced tree, which, i assume, is not good.
If this is the optimal way to create a tree, what would be the length of the longest way to create it? If it is not the optimal way, then what would be the length of the shortest way?
I am new to computer science and I would really appreciate a good explanation.
The shortest code would be length 1 and the longest would be length n-1. There would be two symbols with length n-1, and there is one symbol for each length in 1..n-2.
There is only one optimal tree, and that's it. There is nothing bad about it being unbalanced. In fact it has to be that way to use the least number of bits to codes those symbols with the those frequencies.
I have no idea what you mean by the "shortest" or "longest" way.

Why lookup in a Binary Search Tree is O(log(n))?

I can see how, when looking up a value in a BST we leave half the tree everytime we compare a node with the value we are looking for.
However I fail to see why the time complexity is O(log(n)). So, my question is:
If we have a tree of N elements, why the time complexity of looking up the tree and check if a particular value exists is O(log(n)), how do we get that?
Your question seems to be well answered here but to summarise in relation to your specific question it might be better to think of it in reverse; "what happens to the BST solution time as the number of nodes goes up"?
Essentially, in a BST every time you double the number of nodes you only increase the number of steps to solution by one. To extend this, four times the nodes gives two extra steps. Eight times the nodes gives three extra steps. Sixteen times the nodes gives four extra steps. And so on.
The base 2 log of the first number in these pairs is the second number in these pairs. It's base 2 log because this is a binary search (you halve the problem space each step).
For me the easiest way was to look at a graph of log2(n), where n is the number of nodes in the binary tree. As a table this looks like:
log2(n) = d
log2(1) = 0
log2(2) = 1
log2(4) = 2
log2(8) = 3
log2(16)= 4
log2(32)= 5
log2(64)= 6
and then I draw a little binary tree, this one goes from depth d=0 to d=3:
d=0 O
/ \
d=1 R B
/\ /\
d=2 R B R B
/\ /\ /\ /\
d=3 R B RB RB R B
So as the number of nodes, n, in the tree effectively doubles (e.g. n increases by 8 as it goes from 7 to 15 (which is almost a doubling) when the depth d goes from d=2 to d=3, increasing by 1.) So the additional amount of processing required (or time required) increases by only 1 additional computation (or iteration), because the amount of processing is related to d.
We can see that we go down only 1 additional level of depth d, from d=2 to d=3, to find the node we want out of all the nodes n, after doubling the number of nodes. This is true because we've now searched the whole tree, well, the half of it that we needed to search to find the node we wanted.
We can write this as d = log2(n), where d tells us how much computation (how many iterations) we need to do (on average) to reach any node in the tree, when there are n nodes in the tree.
This can be shown mathematically very easily.
Before I present that, let me clarify something. The complexity of lookup or find in a balanced binary search tree is O(log(n)). For a binary search tree in general, it is O(n). I'll show both below.
In a balanced binary search tree, in the worst case, the value I am looking for is in the leaf of the tree. I'll basically traverse from root to the leaf, by looking at each layer of the tree only once -due to the ordered structure of BSTs. Therefore, the number of searches I need to do is number of layers of the tree. Hence the problem boils down to finding a closed-form expression for the number of layers of a tree with n nodes.
This is where we'll do a simple induction. A tree with only 1 layer has only 1 node. A tree of 2 layers has 1+2 nodes. 3 layers 1+2+4 nodes etc. The pattern is clear: A tree with k layers has exactly
n=2^0+2^1+...+2^{k-1}
nodes. This is a geometric series, which implies
n=2^k-1,
equivalently:
k = log(n+1)
We know that big-oh is interested in large values of n, hence constants are irrelevant. Hence the O(log(n)) complexity.
I'll give another -much shorter- way to show the same result. Since while looking for a value we constantly split the tree into two halves, and we have to do this k times, where k is number of layers, the following is true:
(n+1)/2^k = 1,
which implies the exact same result. You have to convince yourself about where that +1 in n+1 is coming from, but it is okay even if you don't pay attention to it, since we are talking about large values of n.
Now let's discuss the general binary search tree. In the worst case, it is perfectly unbalanced, meaning all of its nodes has only one child (and it becomes a linked list) See e.g. https://www.cs.auckland.ac.nz/~jmor159/PLDS210/niemann/s_fig33.gif
In this case, to find the value in the leaf, I need to iterate on all nodes, hence O(n).
A final note is that these complexities hold true for not only find, but also insert and delete operations.
(I'll edit my equations with better-looking Latex math styling when I reach 10 rep points. SO won't let me right now.)
Whenever you see a runtime that has an O(log n) factor in it, there's a very good chance that you're looking at something of the form "keep dividing the size of some object by a constant." So probably the best way to think about this question is - as you're doing lookups in a binary search tree, what exactly is it that's getting cut down by a constant factor, and what exactly is that constant?
For starters, let's imagine that you have a perfectly balanced binary tree, something that looks like this:
*
/ \
* *
/ \ / \
* * * *
/ \ / \ / \ / \
* * * * * * * *
At each point in doing the search, you look at the current node. If it's the one you're looking for, great! You're totally done. On the other hand, if it isn't, then you either descend into the left subtree or the right subtree and then repeat this process.
If you walk into one of the two subtrees, you're essentially saying "I don't care at all about what's in that other subtree." You're throwing all the nodes in it away. And how many nodes are in there? Well, with a quick visual inspection - ideally one followed up with some nice math - you'll see that you're tossing out about half the nodes in the tree.
This means that at each step in a lookup, you either (1) find the node that you're looking for, or (2) toss out half the nodes in the tree. Since you're doing a constant amount of work at each step, you're looking at the hallmark behavior of O(log n) behavior - the work drops by a constant factor at each step, and so it can only do so logarithmically many times.
Now of course, not all trees look like this. AVL trees have the fun property that each time you descend down into a subtree, you throw away roughly a golden ratio fraction of the total nodes. This therefore guarantees you can only take logarithmically many steps before you run out of nodes - hence the O(log n) height. In a red/black tree, each step throws away (roughly) a quarter of the total nodes, and since you're shrinking by a constant factor you again get the O(log n) lookup time you'd like. The very fun scapegoat tree has a tuneable parameter that's used to determine how tightly balanced it is, but again you can show that every step you take throws away some constant factor based on this tuneable parameter, giving O(log n) lookups.
However, this analysis breaks down for imbalanced trees. If you have a purely degenerate tree - one where every node has exactly one child - then every step down the tree that you take only tosses away a single node, not a constant fraction. That means that the lookup time gets up to O(n) in the worst case, since the number of times you can subtract a constant from n is O(n).
If we have a tree of N elements, why the time complexity of looking up
the tree and check if a particular value exists is O(log(n)), how do
we get that?
That's not true. By default, a lookup in a Binary Search Tree is not O(log(n)), where n is a number of nodes. In the worst case, it can become O(n). For instance, if we insert values of the following sequence n, n - 1, ..., 1 (in the same order), then the tree will be represented as below:
n
/
n - 1
/
n - 2
/
...
1
A lookup for a node with value 1 has O(n) time complexity.
To make a lookup more efficient, the tree must be balanced so that its maximum height is proportional to log(n). In such case, the time complexity of lookup is O(log(n)) because finding any leaf is bounded by log(n) operations.
But again, not every Binary Search Tree is a Balanced Binary Search Tree. You must balance it to guarantee the O(log(n)) time complexity.

Check if 2 tree nodes are related (ancestor/descendant) in O(1) with pre-processing

Check if 2 tree nodes are related (i.e. ancestor-descendant)
solve it in O(1) time, with O(N) space (N = # of nodes)
pre-processing is allowed
That's it. I'll be going to my solution (approach) below. Please stop if you want to think yourself first.
For a pre-processing I decided to do a pre-order (recursively go through the root first, then children) and give a label to each node.
Let me explain the labels in details. Each label will consist of comma-separated natural numbers like "1,2,1,4,5" - the length of this sequence equals to (the depth of the node + 1). E.g. the label of the root is "1", root's children will have labels "1,1", "1,2", "1,3" etc.. Next-level nodes will have labels like "1,1,1", "1,1,2", ..., "1,2,1", "1,2,2", ...
Assume that "the order number" of a node is the "1-based index of this node" in the children list of its parent.
Common rule: node's label consists of its parent label followed by comma and "the order number" of the node.
Thus, to answer if two nodes are related (i.e. ancestor-descendant) in O(1), I'll be checking if the label of one of them is "a prefix" of the other's label. Though I'm not sure if such labels can be considered to occupy O(N) space.
Any critics with fixes or an alternative approach is expected.
You can do it in O(n) preprocessing time, and O(n) space, with O(1) query time, if you store the preorder number and postorder number for each vertex and use this fact:
For two given nodes x and y of a tree T, x is an ancestor of y if and
only if x occurs before y in the preorder traversal of T and after y
in the post-order traversal.
(From this page: http://www.cs.arizona.edu/xiss/numbering.htm)
What you did in the worst case is Theta(d) where d is the depth of the higher node, and so is not O(1). Space is also not O(n).
if you consider a tree where a node in the tree has n/2 children (say), the running time of setting the labels will be as high as O(n*n). So this labeling scheme wont work ....
There are linear time lowest common ancestor algorithms(at least off-line). For instance have a look here. You can also have a look at tarjan's offline LCA algorithm. Please note that these articles require that you know the pairs for which you will be performing the LCA in advance. I think there are also online linear time precomputation time algorithms but they are very complex. For instance there is a linear precomputation time algorithm for the range minimum query problem. As far as I remember this solution passed through the LCA problem twice . The problem with the algorithm is that it had such a large constant that it require enormous input to be actually faster then the O(n*log(n)) algorithm.
There is much simpler approach that requires O(n*log(n)) additional memory and again answers in constant time.
Hope this helps.

Resources