Implementing rooted trees with arbitray number of leaves - algorithm

Let's say the number is given at run-time, eg. 20. Trees are also not necessarily full. Unfortunately reducing the number of leaves doesn't seem to be an option either, as the structure of the tree perserves some physical meaning.
Memory efficiency seems to be a big issue for nodes with more than 1 child. If the memory for the child pointer array has to be reserved/allocated, then a lot of unused memory might be reserved; If using an dynamic array/vector to save the pointers, it's slow when reallocation happens.
So my question is, is there a data structure to perserve the relative parent-child relation while not using a tree with high number of leaves?

Related

Most performant way to find all the leaf nodes in a tree data structure

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

Why B-Tree for file systems?

I know this is a common question and I saw a few threads in Stack Overflow but still couldn't get it.
Here is an accepted answer from Stack overflow:
" Disk seeks are expensive. B-Tree structure is designed specifically to
avoid disk seeks as much as possible. Therefore B-Tree packs much more
keys/pointers into a single node than a binary tree. This property
makes the tree very flat. Usually most B-Trees are only 3 or 4 levels
deep and the root node can be easily cached. This requires only 2-3
seeks to find anything in the tree. Leaves are also "packed" this way,
so iterating a tree (e.g. full scan or range scan) is very efficient,
because you read hundreds/thousands data-rows per single block (seek).
In binary tree of the same capacity, you'd have several tens of levels
and sequential visiting every single value would require at least one
seek. "
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST.
But these nodes are again stored as linked lists right?
I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Please someone explain it to me?
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST. I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Basically, the idea behind using a B+tree in file systems is to reduce the number of disk reads. Imagine that all the blocks in a drive are stored as a sequentially allocated array. In order to search for a specific block you would have to do a linear scan and it would take O(n) every time to find a block. Right?
Now, imagine that you got smart and decided to use a BST, great! You would store all your blocks in a BST an that would take roughly O(log(n)) to find a block. Remember that every branch is a disk access, which is highly expensive!
But, we can do better! The problem now is that a BST is really "tall". Because every node only has a fanout (number of children) factor of 2, if we had to store N objects, our tree would be in the order of log(N) tall. So we would have to perform at most log(N) access to find our leaves.
The idea behind the B+tree structure is to increase the fanout factor (number of children), reducing the height of tree and, thus, reducing the number of disk access that we have to make in order to find a leave. Remember that every branch is a disk access. For instance, if you pack X keys in a node of a B+tree every node will point to at most X+1 children.
Also, remember that a B+tree is structured in a way that only the leaves store the actual data. That way, you can pack more keys in the internal nodes in order to fill up one disk block, that, for instance, stores one node of a B+tree. The more keys you pack in a node the more children it will point to and the shorter your tree will be, thus reducing the number of disk access in order to find one leave.
But these nodes are again stored as linked lists right?
Also, in a B+tree structure, sometimes the leaves are stored in a linked list fashion. Remember that only the leaves store the actual data. That way, with the linked list idea, when you have to perform a sequential access after finding one block you would do it faster than having to traverse the tree again in order to find the next block, right? The problem is that you still have to find the first block! And for that, the B+tree is way better than the linked list.
Imagine that if all the accesses were sequential and started in the first block of the disk, an array would be better than the linked list, because in a linked list you still have to deal with the pointers.
But, the majority of disk accesses, according to Tanenbaum, are not sequential and are accesses to files of small sizes (like 4KB or less). Imagine the time it would take if you had to traverse a linked list every time to access one block of 4KB...
This article explains it way better than me and uses pictures as well:
https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees
A B-tree node is essentially an array, of pairs {key, link}, of a fixed size which is read in one chunk, typically some number of disk blocks. The links are all downwards. At the bottom layer the links point to the associated records (assuming a B+-tree, as in any practical implementation).
I don't know where you got the linked list idea from.
Each node in a B-tree implemented in disk storage consists of a disk block (normally a handful of kilobytes) full of keys and "pointers" that are accessed as an array and not - as you said - a linked list. The block size is normally file-system dependent and chosen to use the file system's read and write operations efficiently. The pointers are not normal memory pointers, but rather disk addresses, again chosen to be easily used by the supporting file system.
The main reason for B-tree is how it behaves on changes. If you have permanent structure, BST is OK, but in that case Hash function is even better. In case of file systems, you want a structure which changes as a whole as little as possible on inserts or deletes, and where you can perform find operation with as little reads as possible - these properties have B-trees.

heap and tree data structure implementation difference

So I see that trees are usually implemented as a list where each node is dynamically allocated and each node contains pointers to two of its children.
But a heap is almost always implemented (or so is recommended in text books) using an array. Why is that? Is there some underlying assumption about the uses of these two data strcutures? For e.g. if you are implementing a priority queue using a min heap then the number of nodes in the queue is constant and so it can be implemented using an array of fixed size. But when you are talking/teaching about a heap in general why recommend implemeting it using an array. Or to flip the question a bit why not recommend learnig about trees with an implementation using arrays?
(I assume by heap you mean binary heap; other heaps are almost always linked nodes.)
A binary heap is always a complete tree, and no operation on it moves whole subtrees around or otherwise alters the topology of the tree in any nontrivial way. This is not an assumption, the first is part of the definition of a heap and the second is immediately obvious from the definition of the operations.
First, since the Ahnentafel layout requires reserving space for every internal node (and all leaf nodes except the rightmost ones), an incomplete tree implemented this way would waste space for nodes that don't exist. Conversely, for a complete tree it's the most efficient layout possible, since all space is actually used for node data, and no space is needed for pointers.
Second, moving a subtree in the array would require copying all child elements to their new positions (since the left child's index is always twice the parent's index, the former changes when the latter changes, recursively down to the leafs). When you have nodes linked via pointers, you only need to move a few pointers around regardless of how large the trees below those pointers are. Moving subtrees is a core component of many algorithms of trees, including all kinds of binary search trees. It needs to be lightning fast for those algorithms to be efficient. Binary heap operations however never need to do this so it's a non-issue.

BTree- predetermined size?

I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.

What are the advantages of T-trees over B+/-trees?

I have explored the definitions of T-trees and B-/B+ trees. From papers on the web I understand that B-trees perform better in hierarchical memory, such as disk drives and cached memory.
What I can not understand is why T-trees were/are used even for flat memory?
They are advertised as space efficient alternative to AVL trees.
In the worst case, all leaf nodes of a T-tree contain just one element and all internal nodes contain the minimum amount allowed, which is close to full. This means that on average only half of the allocated space is utilized. Unless I am mistaken, this is the same utilization as the worst case of B-trees, when the nodes of a B-tree are half full.
Assuming that both trees store the keys locally in the nodes, but use pointers to refer to the records, the only difference is that B-trees have to store pointers for each of the branches. This would generally cause up to 50% overhead or less (over T-trees), depending on the size of the keys. In fact, this is close to the overhead expected in AVL trees, assuming no parent pointer, records embedded in the nodes, keys embedded in the records. Is this the expected efficiency gain that prevents us from using B-trees instead?
T-trees are usually implemented on top of AVL trees. AVL trees are more balanced than B-trees. Can this be connected with the application of T-trees?
I can give you a personal story that covers half of the answer, that is, why I wrote some Pascal code to program B+ trees some 18 years ago.
my target system was a PC with two disk drives, I had to store an index on non volatile memory and I wanted to understand better what I was learning at university. I was very dissatisfied with the performance of a commercial package, probably DBase III, or some Fox product, I can't remember.
anyhow: I needed these operations:
lookup
insertion
deletion
next item
previous item
maximum size of index was not known
so data had to reside on disk
each access to the support had high cost
reading a whole block cost the same as reading one byte
B+-trees made that small slow PC really fly through the data!
the leafs had two extra pointers so they formed a doubly linked list, for sequential searches.
In reality the difference lies in the system you use. As my tutor in university commented it : if your problem lies in memory shortage, or in hdd shortage will determine which tree and in which implementation you will use. Most probably it will be B+ tree.
Because there are hundreds of implementations, for instance with 2direction queue and one directional queues where you need to loop thought elements, and also there are multiple ways to store the index and retrieve it will determine the real cons and mins of any implementation.

Resources