Data structure used in Log server implementation for Certificate Transparency

Data structure used in Log server implementation for Certificate Transparency - data-structures

I attended a security seminar recently which talked about creating log servers to catch any inconsistencies in certificates distributed by the various CAs. Here: http://www.certificate-transparency.org/log-proofs-work
The logs will be append-only hash-based merkle trees. Whenever any CA issues a new certificate it will pass the infromation to the log server which will appended the certificate to the original tree. As is described in the link, new certificates will be added on a daily or better weekly basis. Whenever a group of new certificates has to be added, it will be added as a right node of the root. The left node of the root will be the original merkle tree. The root will be a hash of the left and right sub-trees. Given this approach this tree will become highly unbalanced to the left over time. My question is: Is there any better data structure to implement this functionality so that the tree remains balanced as it grows?

At any time all left nodes in the tree are balanced. See how figure 2 changes to form figure 3 as nodes d6 and d7 are added. m and k are no longer peers as node n is added above k and then m and k are hashed together to form the root.
So any time the number of leaf nodes is 2^n, the entire tree is balanced, at other times, the left side of any node is balanced, but not yet the right.
Hope that helps.

Related

Splay Tree Deletion

I'm having trouble conceptualising the process of deletion from a splay tree. Given this intial, tree, I want to delete the node 78.
Based on the information from my course (derived from Goodrich, Tamassia and Goldwasser), the deleted node in a BST should be replaced by the next node reached by performing an in-order traversal from the node which should be 91. This node should then be splayed to the top of the tree. However, this is not the case as shown on this visualiser here. https://www.cs.usfca.edu/~galles/visualization/SplayTree.html

The visualizer replaced 78 by its in order predecessor (70) instead and splayed that node. (The in order successor, i.e., the next key in sorted order is 83, not 91.) In general, splay trees are wonderfully malleable and as long as you approximately halve the length of the path you just descended while making every other path at most a little bit longer, you’re doing it right from an asymptotic performance standpoint (your professor may have different ideas, however).

Your textbook description:
the deleted node in a BST should be replaced by the next node reached by performing an in-order traversal from the node which should be 91
That description applies to unbalanced BST (binary search trees) but does not apply to most of the various kinds of balanced binary trees, and also does not apply to Splay Trees. To delete a node in a splay tree do the following:
Splay the node to be deleted to the root and dispose of it. This leaves two trees, call the left tree A and the right tree B.
The new root of the recombined tree will come from A. Splay the largest (rightmost) node of A tree to its root.
Because A's new root has the greatest key in A, it has no right child. Set the right child of A's new root to B.
A is the new combined tree.
This is what the visualization at https://www.cs.usfca.edu/%7Egalles/visualization/SplayTree.html did.
You said in comments to the other answer:
So in practice, the node that you choose to replace the deleted node doesn't really matter, i.e. affect performance etc.
In the typical splay tree deletion algorithm the node to replace will be the predecessor or successor node, in key order.
The rule of thumb is to always splay whenever a specific node is accessed. Find the node to delete, then splay it to the root. Find its predecessor, then splay it to the root. There are variations where you can splay less aggressively, too.

What is a zip tree, and how does it work?

I've heard of a new balanced BST data structure called a zip tree. What is the zip tree? How does it work?

At a high level, a zip tree is a
randomized balanced binary search tree,
that is a way of encoding a skiplist as a BST, and
that uses a pair of operations called zipping and unzipping rather than tree rotations.
The first bullet point - that zip trees are randomized, balanced BSTs - gives a feel for what a zip tree achieves at a high level. It's a type of balanced binary search tree that, like treaps and unlike red/black trees, uses randomization to balance the tree. In that sense, a zip tree isn't guaranteed to be a balanced tree, but rather has a very high probability of being balanced.
The second bullet point - that zip trees are encodings of skiplists - shows where zip trees come from and why, intuitively, they're balanced. You can think of a zip tree as a way of taking the randomized skiplist data structure, which supports all major operations in expected time O(log n), and representing it as a binary search tree. This provides the intuition for where zip trees come from and why we'd expect them to be so fast.
The third bullet point - zip trees use zipping and unzipping rather than tree rotations - accounts for the name of the zip tree and what it feels like to code one up. Zip trees differ from other types of balanced trees (say, red/black trees or AVL trees) in that nodes are moved around the tree not through rotations, but through a pair of operations that convert a larger chain of nodes into two smaller chains or vice-versa.
The rest of this answer dives deeper into where zip trees come from, how they work, and how they're structured.
Review: Skip Lists
To understand where zip trees come from, let's begin with a review of another data structure, the skiplist. A skiplist is a data structure that, like a binary search tree, stores a collection of elements in sorted order. Skiplists, however, aren't tree structures. Rather, a skiplist works by storing elements in sorted order through several layers of linked lists. A sample skiplist is shown here:
As you can see, the elements are represented in sorted order. Each element has an associated height, and is part of a number of linked lists equal to its height. All of the elements of the skiplist participate in the bottom layer. Ideally, roughly half of the nodes will be in the layer above that, roughly a quarter of the nodes will be in the layer above that, roughly an eighth of the nodes will be in the layer above that, etc. (More on how this works later on.)
To do a lookup in a skiplist, we begin in the topmost layer. We walk forward in the skiplist until either (1) we find the element we're looking for, (2) we find an element bigger than the one we're looking for, or (3) we hit the end of the list. In the first case, we uncork the champagne and celebrate because we discovered the item we were searching for and there's nothing more to do. In the second case or third cases, we've "overshot" the element that we're looking for. But that's nothing to worry about - in fact, that's helpful because it means that what we're looking for must be between the node we hit that "overshot" and the node that comes before it. So we'll go to the previous node, drop down one layer, and pick up our search from there.
For example, here's how we'd do a search for 47:
Here, the blue edges indicate the links followed where we moved forward, and the red edges indicate where we overshot and decided to descend down a layer.
A powerful intuition for how skiplists work - which we'll need later on as we transition to zip trees - is that the topmost layer of the skiplist partitions the remaining elements of the skiplists into different ranges. You can see this here:
Intuitively, a skiplist search will be "fast" if we're able to skip looking at most of the elements. Imagine, for example, that the second-to-last layer of the skiplist only stores every other element of the skiplist. In that case, traversing the second-to-last layer is twice as fast as traversing the bottom layer, so we'd expect a lookup starting in the second-to-last layer to take half as much time as a lookup starting in the bottom layer. Similarly, imagine that the layer above that one only stores every other element from the layer below it. Then searching in that layer will take roughly half as much time as searching the layer below it. More generally, if each layer only stores roughly half the elements of the layer below it, then we could skip past huge amounts of the elements in the skiplist during a search, giving us good performance.
The skiplist accomplishes this by using the following rule: whenever we insert an element into the skiplist, we flip a coin until we get heads. We then set the height of the newly-inserted node to be the number of coins that we ended up tossing. This means it has a 50% chance to stay in its current layer and a 50% chance to move to the layer above it, which means, in aggregate, that roughly half the nodes will only be in the bottom layer, roughly half of what's left will be one layer above that, roughly half of what's left will be one layer above that, etc.
(For those of you with a math background, you could also say that the height of each node in the skiplist is a Geom(1/2) random variable.)
Here's an example of inserting 42 into the skiplist shown above, using a height of 1:
Deletion from a skiplist is also a fairly simple operation: we simply splice it out of whatever linked lists it happens to be in. That means that if we were to delete the 42 we just inserted from the above list, we'd end up with the same skiplist that we started with.
It can be shown that the expected cost of an insertion, deletion, or lookup in a skiplist is O(log n), based on the fact that the number of items in each list is roughly half the number of items in the one below it. (That means we'd expect to see O(log n) layers, and only take a constant number of steps in each layer.)
From Skiplists to Zip Trees
Now that we've reviewed skiplists, let's talk about where the zip tree comes from.
Let's imagine that you're looking at the skiplist data structure. You really like the expected O(log n) performance of each operation, and you like how conceptually simple it is. There's just one problem - you really don't like linked lists, and the idea of building something with layers upon layers of linked lists doesn't excite you. On the other hand, you really love binary search trees. They've got a really simple structure - each node has just two pointers leaving it, and there's a simple rule about where everything gets placed. This question then naturally arises: could you get all the benefits of a skiplist, except in BST form?
It turns out that there's a really nice way to do this. Let's imagine that you have the skiplist shown here:
Now, imagine you perform a lookup in this skiplist. How would that search work? Well, you'd always begin by scanning across the top layer of the skiplist, moving forward until you found a key that was bigger than the one you were looking for, or until you hit the end of the list and found that there were no more nodes at the top level. From there, you'd then "descend" one level into a sub-skiplist containing only the keys between the last node you visited and the one that overshot.
It's possible to model this exact same search as a BST traversal. Specifically, here's how we might represent the top layer of that skiplist as a BST:
Notice that all these nodes chain to the right, with the idea being that "scanning forward in the skiplist" corresponds to "visiting larger and larger keys." In a BST, moving from one node to a larger node corresponds to moving right, hence the chain of nodes to the right.
Now, each node in a BST can have up to two children, and in the picture shown above each node has either zero children or one child. If we fill in the missing children by marking what ranges they correspond to, we get this.
And hey, wait a minute! It sure looks like the BST is partitioning the space of keys the same way that the skiplist is. That's promising, since it suggests that we're on to something here. Plus, it gives us a way to fill in the rest of the tree: we can recursively convert the subranges of the skiplist into their own BSTs and glue the whole thing together. If we do that, we get this tree encoding the skiplist:
We now have a way of representing a skiplist as a binary search tree. Very cool!
Now, could we go the other way around? That is, could we go from a BST to a skiplist? In general, there's no one unique way to do this. After all, when we converted the skiplist to a BST, we did lose some information. Specifically, each node in the skiplist has an associated height, and while each node in our BST has a height as well it's not closely connected to the skiplist node heights. To address this, let's tag each BST node with the height of the skiplist node that it came from. This is shown here:
Now, some nice patterns emerge. For starters, notice that each node's associated number is bigger than its left child's number. That makes sense, since each step to the left corresponds to descending into a subrange of the skiplist, where nodes will have lower heights. Similarly, each node's associated number is greater than or equal to the number of its right child. And that again makes sense - moving to the right either means
continuing forward at the same level that we were already on, in which case the height remains the same, or
hitting the end of a range and descending into a subrange, in which case the height decreases.
Can we say more about the shape of the tree? Sure we can! For example, in a skiplist, each node's height is picked by flipping coins until we get heads, then counting how many total coins we flipped. (Or, as before, it's geometrically distributed with probability 1/2). So if we were to imagine building a BST that corresponded to a skiplist, we'd want the numbers assigned to the nodes to work out the same way.
Putting these three rules together, we get the following, which defines the shape of our tree, the zip tree!
A zip tree is a binary search tree where
Each node has an associated number called its rank. Ranks are assigned randomly to each node by flipping coins until heads is flipped, then counting how many total coins were tossed.
Each node's rank is strictly greater than its left child's rank.
Each node's rank is greater than or equal to its right child's rank.
It's amazing how something like a skiplist can be represented as a BST by writing out such simple rules!
Inserting Elements: Unzipping
Let's suppose you have a zip tree. How would you insert a new element into it?
We could in principle answer this question by looking purely at the rules given above, but I think it's a lot easier to figure this out by remembering that zip trees are skiplists in disguise. For example, here's the above zip tree, with its associated skiplist:
Now, suppose we want to insert 18 into this zip tree. To see how this might play out, imagine that we decide to give 18 a rank of 2. Rather than looking at the zip tree, let's look at what would happen if we did the insertion into the skiplist. That would give rise to this skiplist:
If we were to take this skiplist and encode it as a zip tree, we'd get the following result:
What's interesting about this is that we can see what the tree needs to look like after the insertion, even if we don't know how to perform the insertion. We can then try to figure out what the insertion logic needs to look like by reverse-engineering it from these "before" and "after" pictures.
Let's think about what change this insertion made to our zip tree. To begin with, let's think back to our intuition for how we encode skiplists as zip trees. Specifically, chains of nodes at the same level in a skiplist with no intervening "higher" elements map to chains of nodes in the zip tree that lean to the right. Inserting an element into the skiplist corresponds to adding some new element into one of the levels, which has the effect of (1) adding in something new into some level of the skiplist, and (2) taking chains of elements in the skiplist that previously were adjacent at some level, then breaking those connections.
For example, when we inserted 18 into the skiplist shown here, we added something new into the blue chain highlighted here, and we broke all of the red chains shown here:
What is that going to translate into in our zip tree? Well, we can highlight the blue link where our item was inserted here, as well as the red links that were cut:
Let's see if we can work out what's going on here. The blue link here is, fortunately, pretty easy to find. Imagine we do a regular BST insertion to add 18 into our tree. As we're doing so, we'll pause when we reach this point:
Notice that we've hit a key with the same rank as us. That means that, if we were to keep moving to the right, we'd trace out this region of the skiplist:
To find the blue edge - the place where we go - we just need to walk down through this chain of nodes until we find one bigger than us. The blue edge - our insertion point - is then given by the edge between that node and the one above it.
We can identify this location in a different way: we've found the blue edge - our insertion point - when we've reached a point where the node to insert (1) has a bigger rank than the node to the left, (2) has a rank that's greater than or equal to the node on the right, and (3) if the node to the right has the same rank, our new item to insert is less than the item to the right. The first two rules ensure that we're inserting into the right level of the skiplist, and the last rule ensures that we insert into the right place in that level of the skiplist.
Now, where are our red edges? Intuitively, these are the edges that were "cut" because 18 has been added into the skiplist. Those would be items that previously were between the two nodes on opposite ends of the blue edge, but which node need to get partitioned into the new ranges defined by the split version of that blue edge.
Fortunately, those edges appear in really nice places. Here's where they map to:
(In this picture, I've placed the new node 18 in the middle of the blue edge that we identified in the skiplist. This causes the result not to remain a BST, but we'll fix that in a minute.)
Notice that these are the exact same edges that we'd encounter if we were to finish doing our regular BST insertion - it's the path traced out by looking for 18! And something really nice happens here. Notice that
each time we move to the right, the node, when cut, goes to the right of 18, and
each time we move to the left, the node, when cut, goes to the left of 18.
In other words, once we find the blue edge where we get inserted, we keep walking as though we were doing our insertion as usual, keeping track of the nodes where we went left and the nodes where we went right. We can then chain together all the nodes where we went left and chain together all the nodes where we went right, gluing the results together under our new node. That's shown here:
This operation is called unzipping, and it's where we get the name "zip tree" from. The name kinda make sense - we're taking two interleaved structures (the left and right chains) and splitting them apart into two simpler linear chains.
To summarize:
Inserting x into a zip tree works as follows:
Assign a random rank to x by flipping coins and counting how many flips were needed to get heads.
Do a search for x. Stop the search once you reach a node where
the node's left child has a lower rank than x,
the node's right child has a rank less than or equal to x, and
the node's right child, if it has the same rank as x, has a larger key than x.
Perform a unzip. Specifically:
Continue the search for x as before, recording when we move left and when we move right.
Chain all the nodes together where we went left by making each the left child of the previously-visited left-moving node.
Chain all the nodes together where we went right by making each the right child of the previously-visited right-moving node.
Make those two chains the children of the node x.
You might notice that this "unzipping" procedure is equivalent to what you'd get if you performed a different operation. You could achieve the same result by inserting x as usual, then using tree rotations to pull x higher and higher in the tree until it came to rest in the right place. This is a perfectly valid alternative strategy for doing insertions, though it's a bit slower because two passes over the tree are required (a top-down pass to insert at a leaf, then a bottom-up pass to do the rotations).
Removing Elements: Zipping
Now that we've seen how to insert elements, how do we remove them?
Let's begin with a helpful observation: if we insert an item into a zip tree and then remove it, we should end up with the exact same tree that we started with. To see why this is, we can point back to a skiplist. If you add and then remove something from a skiplist, then you end up with the same skiplist that you would have had before. So that means that the zip tree needs to end up looking identical to how it started after we add and then remove an element.
To see how to do this, we'd need to perform two steps:
Undo the unzip operation, converting the two chains of nodes formed back into a linear chain of nodes.
Undo the break of the blue edge, restoring the insertion point of x.
Let's begin with how to undo an unzip operation. This, fortunately, isn't too bad. We can identify the chains of nodes that we made with the unzip operation when we inserted x into the zip tree fairly easily - we simply look at the left and right children of x, then move, respectively, purely to the left and purely to the right.
Now, we know that these nodes used to be linked together in a chain. What order do we reassemble them into? As an example, take a look a this part of a zip tree, where we want to remove 53. The chains to the left and right of 53 are highlighted:
If we look at the nodes making up the left and right chains, we can see that there's only one way to reassemble them. The topmost node of the reassembled chain must be 67, since it has rank 3 and will outrank all other items. After that, the next node must be 41, because it's the smaller of the rank-2 elements and elements with the same rank have smaller items on top. By repeating this process, we can reconstruct the chain of nodes, as shown here, simply by using the rules for how zip trees have to be structured:
This operation, which interleaves two chains together into one, is called zipping.
To summarize, here's how a deletion works:
Deleting a node x from a zip tree works as follows:
Find the node x in the tree.
Perform a zip of its left and right subtrees. Specifically:
Maintain "lhs" and "rhs" pointers, initially to the left and right subtrees.
While both those pointers aren't null:
If lhs has a higher rank than rhs, make lhs's right child rhs, then advance lhs to what used to be lhs's right child.
Otherwise, make rhs's left child lhs, then advance rhs to point to what used to be rhs's left child.
Rewire x's parent to point to the result of the zip operation rather than x.
More to Explore
To recap our main points: we saw how to represent a skiplist as a BST by using the idea of ranks. That gave rise to the zip tree, which uses ranking rules to determine parent/child relationships. Those rules are maintained using the zip and unzip operations, hence the name.
Doing a full analysis of a zip list is basically done by reasoning by analogy to a skiplist. We can show, for example, that the expected runtime of an insertion or deletion is O(log n) by pointing at the equivalent skiplist and noting that the time complexity of the equivalent operations there are O(log n). And we can similary show that these aren't just expected time bounds, but expected time bounds with a high probability of occurring.
There's a question of how to actually store the information needed to maintain a zip tree. One option would be to simply write the rank of each item down in the nodes themselves. That works, though since ranks are very unlikely to exceed O(log n) due to the nature of geometric random variables, that would waste a lot of space. Another alternative would be to use a hash function on node addresses to generate a random, uniformly-distributed integer in some range, then find the position of the most least-significant 1 bit to simulate our coin tosses. That increases the costs of insertions and deletions due to the overhead of computing the hash codes, but also decreases the space usage.
Zip trees aren't the first data structure to map skiplists and BSTs together. Dean and Jones developed an alternative presentation of this idea in 2007. There's also another way to exploit this connection. Here, we started with a randomized skiplist, and used it to derive a randomized BST. But we can run this in reverse as well - we can start with a deterministic balanced BST and use that to derive a deterministic skiplist. Munro, Papadakis, and Sedgewick found a way to do this by connecting 2-3-4 trees and skiplists.
And zip trees aren't the only randomized balanced BST. The treap was the first structure to do this, and with a little math you can show that treaps tend to have slightly lower expected heights than zip trees. The tradeoff, though, is that you need more random bits per node than in a zip tree.
Hope this helps!

Number of subtrees of root node in a B tree

The definition of a B tree I have read in various of books all contains the following
Every node except the root node has to be at least half full
If the root node is an index node, it must have at least two children.
I presume that the second special case is to allow a B tree to have, say, only one key and still be valid. However, if the B tree has many nodes, is it still allowed for the root node to have only two subtrees? Won't this break the guarantee of B tree like easy splitting and joining operation?

However, if the B tree has many nodes, is it still allowed for the root node to have only two subtrees?
Yes, the root is special-cased because every other internal node has siblings that it can merge with.
Suppose that we delete a key and that, as a result, some internal node has too few children. We have two options in the usual B-tree algorithms: have this node take some children from its siblings or just merge siblings outright (possibly propagating the deficiency toward the root). Neither is an option for the root, so we just exempt it from the minimum children requirement. This increases the max height for a given number of keys by at most one, so the asymptotic running time of operations is unaffected.

Why storing data only in the leaf nodes of a balanced binary-search tree?

I have bought a nice little book about computational geometry. While reading it here and there, I often stumbled over the use of this special kind of binary search tree. These trees are balanced and should store the data only in the leaf nodes, whereas inner nodes should only store values to guide the search down to the leaves.
The following image shows an example of this trees (where the leaves are rectangles and the inner nodes are circles).
I have two questions:
What is the advantage of not storing data in the inner nodes?
For the purpose of learning, I would like to implement such a tree. Therefore, I thought it might be a good idea to use an AVL tree as the basis, but is it a good idea?
Any kind of helpful resource is very welcome.

What is the advantage of not storing data in the inner nodes?
There are some tree data structures that, by design, require that no data is stored in the inner nodes, such as Huffman code trees and B+ trees. In the case of Huffman trees, the requirement is that no two leaves have the same prefix (i.e. the path to node 'A' is 101 whereas the path to node 'B' is 10). In the case of B+ trees, it comes from the fact that it is optimized for block-search (this also means that every internal node has a lot of children, and that the tree is usually only a few levels deep).
For the purpose of learning, I would like to implement such a tree. Therefore, I thought it might be a good idea to use an AVL tree as the basis, but is it a good idea?
Sure! An AVL tree is not extremely complicated, so it's a good candidate for learning.

It is common to have other kinds of binary trees with data at the leaves instead of the interior nodes, but fairly uncommon for binary SEARCH trees.
One reason you might WANT to do this is educational -- it's often EASIER to implement a binary search tree this way then the traditional way. Why? Almost entirely because of deletions. Deleting a leaf is usually very easy, whereas deleting an interior node is harder/messier. If your data is only at the leaves, then you are always in the easy case!
It's worth thinking about where the keys on interior nodes come from. Often they are duplicates of keys that are also at the leaves (with data). Later, if the key at the leaf is deleted, the key at the interior nodes might still hang around.

What is the advantage of not storing data in the inner nodes?
In general, there is no advantage in not storing data in the inner nodes. For example, a red-black tree is a balanced tree and it stores its data into the inner and leaf nodes.
For the purpose of learning, I would like to implement such a tree. Therefore, I thought it might be a good idea to use an AVL tree as the basis, but is it a good idea?
In my opinion, it is.

One benefit to only keeping the data in leaf nodes (e.g., B+ tree) is that scanning/reading the data is exceedingly simple. The leaf nodes are linked together. So to read the next item when you are at the "end" (right or left) of the data within a given leaf node, you just read the link/pointer to the next (or previous) node and jump to the next leaf page.
With a B tree where data is in every node, you have to traverse the tree to read the data in order. That is certainly a well-defined process but is arguably more complex and typically requires more state information.

I am reading the same book and they say it could be done either way, data storage at external or at internal nodes.
The trees they use are Red-Black.
In any case, here is an article that stores data at internal nodes of a Red Black Tree and then links these data nodes together as a list.
Balanced binary search tree with a doubly linked list in C++
by Arjan van den Boogaard
http://archive.gamedev.net/archive/reference/programming/features/TStorage/default.html

Split 2-3 tree into less-than and greater-than given value X

I need to write function, which receives some key x and split 2-3 tree into 2 2-3 trees. In first tree there are all nodes which are bigger than x, and in second which are less. I need to make it with complexity O(logn). thanks in advance for any idea.
edited
I thought about finding key x in the tree. And after split its two sub-trees(bigger or lesser if they exist) into 2 trees, and after begin to go up and every time to check sub-trees which I've not checked yet and to join to one of the trees. My problem is that all leaves must be at the same level.

If you move from the root to your key and split each node so one points at the nodes larger than the key and the other at the rest and then make the larger node be a part of your larger tree, say by having the leftmost node at one level higher point at it, (don't fix the tree yet, do it at the end) until you reach the key you will get your trees. Then you just need to fix both trees on the path you used (note that the same path exists on both trees).

Assuming you have covered 2-3-4 trees in the lecture already, here is a hint: see whether you can apply the same insertion algorithm for 2-3 trees also. In particular, make insertions always start in the leaf, and then restructure the tree appropriately. When done, determine the complexity of the algorithm you got.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio