So I see that trees are usually implemented as a list where each node is dynamically allocated and each node contains pointers to two of its children.
But a heap is almost always implemented (or so is recommended in text books) using an array. Why is that? Is there some underlying assumption about the uses of these two data strcutures? For e.g. if you are implementing a priority queue using a min heap then the number of nodes in the queue is constant and so it can be implemented using an array of fixed size. But when you are talking/teaching about a heap in general why recommend implemeting it using an array. Or to flip the question a bit why not recommend learnig about trees with an implementation using arrays?
(I assume by heap you mean binary heap; other heaps are almost always linked nodes.)
A binary heap is always a complete tree, and no operation on it moves whole subtrees around or otherwise alters the topology of the tree in any nontrivial way. This is not an assumption, the first is part of the definition of a heap and the second is immediately obvious from the definition of the operations.
First, since the Ahnentafel layout requires reserving space for every internal node (and all leaf nodes except the rightmost ones), an incomplete tree implemented this way would waste space for nodes that don't exist. Conversely, for a complete tree it's the most efficient layout possible, since all space is actually used for node data, and no space is needed for pointers.
Second, moving a subtree in the array would require copying all child elements to their new positions (since the left child's index is always twice the parent's index, the former changes when the latter changes, recursively down to the leafs). When you have nodes linked via pointers, you only need to move a few pointers around regardless of how large the trees below those pointers are. Moving subtrees is a core component of many algorithms of trees, including all kinds of binary search trees. It needs to be lightning fast for those algorithms to be efficient. Binary heap operations however never need to do this so it's a non-issue.
Related
Let's say the number is given at run-time, eg. 20. Trees are also not necessarily full. Unfortunately reducing the number of leaves doesn't seem to be an option either, as the structure of the tree perserves some physical meaning.
Memory efficiency seems to be a big issue for nodes with more than 1 child. If the memory for the child pointer array has to be reserved/allocated, then a lot of unused memory might be reserved; If using an dynamic array/vector to save the pointers, it's slow when reallocation happens.
So my question is, is there a data structure to perserve the relative parent-child relation while not using a tree with high number of leaves?
Is there a balanced BST structure that also keeps track of subtree size in each node?
In Java, TreeMap is a red-black tree, but doesn't provide subtree size in each node.
Previously, I did write some BST that could keep track subtree size of each node, but it's not balanced.
The questions are:
Is it possible to implement such a tree, while keeping efficiency of (O(lg(n)) for basic operations)?
If yes, then is there any 3rd-party libraries provide such an impl?
A Java impl is great, but other languages (e.g c, go) would also be helpful.
BTW:
The subtree size should be kept track in each node.
So that could get the size without traversing the subtree.
Possible appliation:
Keep track of rank of items, whose value (that the rank depends on) might change on fly.
The Weight Balanced Tree (also called the Adams Tree, or Bounded Balance tree) keeps the subtree size in each node.
This also makes it possible to find the Nth element, from the start or end, in log(n) time.
My implementation in Nim is on github. It has properties:
Generic (parameterized) key,value map
Insert (add), lookup (get), and delete (del) in O(log(N)) time
Key-ordered iterators (inorder and revorder)
Lookup by relative position from beginning or end (getNth) in O(log(N)) time
Get the position (rank) by key in O(log(N)) time
Efficient set operations using tree keys
Map extensions to set operations with optional value merge control for duplicates
There are also implementations in Scheme and Haskell available.
That's called an "order statistic tree": https://en.wikipedia.org/wiki/Order_statistic_tree
It's pretty easy to add the size to any kind of balanced binary tree (red-black, avl, b-tree, etc.), or you can use a balancing algorithm that works with the size directly, like weight-balanced trees (#DougCurrie answer) or (better) size-balanced trees: https://cs.wmich.edu/gupta/teaching/cs4310/lectureNotes_cs4310/Size%20Balanced%20Tree%20-%20PEGWiki%20sourceMayNotBeFullyAuthentic%20but%20description%20ok.pdf
Unfortunately, I don't think there are any standard-library implementations, but you can find open source if you look for it. You may want to roll your own.
I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.
Which one is better , Link-list or Tree, Memory-wise (RAM) ..?
Link-list is a linear structure. Or Tree is Leveled Structure( child-nodes) .
Which one is better memory-wise. Not searching-wise.
Besides Damien's witty comment : what sort of tree ? Binary ? Red/black ? Ternary ? With a linked-list of children for each node ? Nodes referencing their parent or not ?
Once you chose your data structure, you just look at the overhead for each node. For instance, a singly linked list node's overhead is one pointer to the next element. A simple binary tree node's overhead typically will be two pointers : one two each child. So there you go, simple as that. That particular list would have twice less overhead than that particular tree, only considering the data structure itself.
When comparing a linked-list and a tree, memory rarely is to consider because the purposes of these 2 data structures are completely different. In terms of memory linked list can be compared to a vector (an array): because a vector stores items on adjacent memory, it does not need a pointer along with each item so a vector/array consumes less memory. A tree needs a vector of children in each node, while each item in this vector is a pointer to a child node. So a tree consumes at least as much memory as a linked-list because for each node except the root a pointer to that node is stored in its parent.
I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.