Maximum number of nodes in a Huffman tree - binary-tree

For the purpose of memory performance optimizations while building a Huffman tree I would like to pre-allocate the necessary memory for its nodes.
Is there a way to calculate the maximum number of nodes (internal nodes plus leafs)?
Input for the calculation should be the table of symbols and their probabilities/frequencies. I would like to avoid a simulated tree building run. Instead it should be a plain calculation that must not give the actual/optimal number of nodes but a reliable maximum.

If there are n symbols, then there are n-1 internal nodes, or 2n-1 internal nodes and leaves, or what you're calling nodes. It's always exactly that — not a "maximum".

Related

What are the maximum number of duplicates for a given key in a B+-tree

Given a B+-tree with a branching factor of g and s being the number of keys contained, what would be the maximum number of duplicates allowed for a single key lets call this si? And how do we calculate that number?
My first idea was to say that each level can have have one instance of si, so my idea would be to maximise the depth, which would be our answer, however I'm not sure about this.
I have searched online but it seems no one has asked this question before, first time asking a question here so any feedback is welcome.
Many thanks.
A proper B-tree cannot contain duplicate 'keys' because the values in question would not uniquely identify their associated data and hence they would not be keys.
The reason is that B-trees do not distinguish between nodes for routing searches (internal nodes) and nodes for storing data (leaf nodes) as some other data structures do. All nodes in a classic B-tree are data-bearing, and the internal nodes just happen to also route search traffic.
B+ trees, on the other hand, store all data in leaf nodes and use the nodes above the leaf level (a.k.a. 'index layer') only for routing. The values in internal nodes - a.k.a. 'separator keys' - only have to guide searches but they do not have to uniquely identify any data records. They do not even have to correspond to any actually existing key values (i.e. ones that have associated data). That makes it often possible to shorten the separator keys drastically, as long as they keep separating the same things. For example, "F" is just as effective at separting "Arthur Dent" from "Zaphod Beeblebrox" as "Ford Prefect" is.
One consequence is that one and the same value could potentially occur at each and every level of the B+ tree without any ill effect, since only the one and only occurrence at the leaf level actually works as a data key; the ones in internal nodes serve only to guide searches.
Another consequence is that internal nodes in B+ trees can usually hold orders of magnitude more keys than internal nodes in B-trees (which are diluted by record data and do not allow shortening of key values). This increases I/O and cache efficiency, and it usually makes B+ trees more shallow than B-trees containing the same data.
As for calculating the height: outside of homework assignments you actually need two node capacity values to describe a B+ tree, one for the maximum number of separator keys in an internal node and one for the maximum number of keys (data records) in a leaf node. Basically, you divide the number of records by the leaf node capacity to determine the number of 'outputs' that the index layer needs to have, and then feed that into the usual B-tree height formula in order to determine the height of the index layer.

Implementing rooted trees with arbitray number of leaves

Let's say the number is given at run-time, eg. 20. Trees are also not necessarily full. Unfortunately reducing the number of leaves doesn't seem to be an option either, as the structure of the tree perserves some physical meaning.
Memory efficiency seems to be a big issue for nodes with more than 1 child. If the memory for the child pointer array has to be reserved/allocated, then a lot of unused memory might be reserved; If using an dynamic array/vector to save the pointers, it's slow when reallocation happens.
So my question is, is there a data structure to perserve the relative parent-child relation while not using a tree with high number of leaves?

Most performant way to find all the leaf nodes in a tree data structure

I have a tree data structure where each node can have any number of children, and the tree can be of any height. What is the optimal way to get all the leaf nodes in the tree? Is it possible to do better than just traversing every path in the tree until I hit the leaf nodes?
In practice the tree will usually have a max depth of 5 or so, and each node in the tree will have around 10 children.
I'm open to other types of data structures or special trees that would make getting the leaf nodes especially optimal.
I'm using javascript but really just looking for general recommendations, any language etc.
Thanks!
Memory layout is essential to optimal retrieval, so the child lists should be contiguous and not linked list, the nodes should be place after each other in retrieval order.
The more static your tree is, the better layout can be done.
All in one layout
All in one array totally ordered
Pro
memory can be streamed for maximal throughput (hardware pre-fetch)
no unneeded page lookups
normal lookups can be made
no extra memory to make linked lists.
internal nodes use offset to find the child relative to itself
Con
inserting / deleting can be cumbersome
insert / delete O(N)
insert might lead to resize of the array leading to a costly copy
Two array layout
One array for internal nodes
One array for leafs
Internal nodes points to the leafs
Pro
leaf nodes can be streamed at maximum throughput (maybe the best layout if your mostly only interested in the leafs).
no unneeded page lookups
indirect lookups can be made
Con
if all leafs are ordered insert / delete can be cumbersome
if leafs are unordered insertion is ease, just add at the end.
deleting unordered leafs is also a problem if no tombstones are allowed as the last leaf would have to be moved back and the internal nodes would need fix up. (via a further indirection this can also be fixed see slot-map)
resizing of the either might lead to a large copy, though less than the All-in-one as they could be done independently.
Array of arrays (dynamic sized, C++ vector of vectors)
using contiguous arrays for referencing the children of each node
Pro
running through each child list is fast
each child array may be resized independently
Con
while removing much of the extra work of linked list children the individual lists are dispersed among all other data making lookup taking extra time.
insert might cause resize and copy of an array.
Finding the leaves of a tree is O(n), which is optimal for a tree, because you have to look at O(n) places to retrieve all n things, plus the branch nodes along the way. The constant overhead is the branch nodes.
If we increase the branching factor, e.g. letting each branch have 32 children instead of 2, we significantly decrease the number of overhead nodes, which might make the traversal faster.
If we skip a branch, we're not including the values in that branch, so we have to look at all branches.

GPU-based inclusive scan on an unbalanced tree

I have the following problem: I need to compute the inclusive scans (e.g. prefix sums) of values based on a tree structure on the GPU. These scans are either from the root node (top-down) or from the leaf nodes (bottom-up). The case of a simple chain is easily handled, but the tree structure makes parallelization rather difficult to implement efficiently.
For instance, after a top-down inclusive scan, (12) would hold (0)[op](6)[op](7)[op](8)[op](11)[op](12), and for a bottom-up inclusive scan, (8) would hold (8)[op](9)[op](10)[op](11)[op](12), where [op] is a given binary operator (matrix addition, multiplication etc.).
One also needs to consider the following points:
For a typical scenario, the length of the different branches should not be too long (~10), with something like 5 to 10 branches, so this is something that will run inside a block and work will be split between the threads. Different blocks will simply handle different values of nodes. This is obviously not optimal regarding occupancy, but this is a constraint on the problem that will be tackled sometime later. For now, I will rely on Instruction-level parallelism.
The structure of the graph cannot be changed (it describes an actual system), thus it cannot be balanced (or only by changing the root of the tree, e.g. using (6) as the new root). Nonetheless, a typical tree should not be too unbalanced.
I currently use CUDA for GPGPU, so I am open to any CUDA-enabled template library that can solve this issue.
Node data is already in global memory and the result will be used by other CUDA kernels, so the objective is just to achieve this without making it a huge bottleneck.
There is no "cycle", i.e. branches cannot merge down the tree.
The structure of the tree is fixed and set in an initialization phase.
A single binary operation can be quite expensive (e.g. multiplication of polynomial matrices, i.e. each element is a polynomial of a given order).
In this case, what would be the "best" data structure (for the tree structure) and the best algorithms (for the inclusive scans/prefix sums) to solve this problem?
Probably a harebrained idea, but imagine that you insert nodes of 0 value into the tree, in such a way that you get a 2D matrix. For instance, there would be 3 zero value nodes below the 5 node in your example. Then use one thread to travel each level of the matrix horizontally. For the top-down prefix sum, offset the threads in such a way that each lower thread is delayed by the maximum number of branches the tree can have in that location. So, you get a "wave" with a slanted edge running over the matrix. The upper threads, being further along, calculate those nodes in time for them to be processed further by threads running further down. You would need the same number of threads as the tree is deep.
I think parallel prefix scan may not suitable for your case because:
parallel prefix scan algorithm will increase the total number of [op], in your link of prefix sum, a 16-input parallel prefix scan requires 26 [op], while a sequential version only need 15. parallel algorithm performs better is based on a assumption that there's enough hardware resources to run multiple [op] in parallel.
You could evaluate the cost of your [op] before try the parallel prefix scan.
On the other hand, since the size of the tree is small, I think you could simply consider your tree as 4 (number of the leaf nodes) independent simple chains, and use concurrent kernel execution to improve the performance of these 4 prefix scan kernels
0-1-2-3
0-4-5
0-6-7-8-9-10
0-6-7-8-11-12
I think in Kepler GK110 architecture, you can invoke kernels recursively, which they call dynamic parallelism. So, if you need to compute the sum of the values at each node of the tree, dynamic parallelism would help. However, the depth of recursion might be a constraint.
My first impression is that you could organize the tree nodes in a 1 dimensional array, similar to what Eric suggested. And then you could do a Segmented Prefix Sum Scan (http://en.wikipedia.org/wiki/Segmented_scan) over that array.
Using your tree nodes as an example, the 1-dim array would look like:
0-1-2-3-0-4-5-0-6-7-8-9-10-0-6-7-8-11-12
And then you would have a parallel array of flags indicating where a new list begins (by list I mean a sequence beginning at the root and ending at a leaf node):
1-0-0-0-1-0-0-1-0-0-0-0- 0-1-0-0-0- 0- 0
For the bottom-up case, you could create a separate segment-flag array like so:
0-0-0-1-0-0-1-0-0-0-0-0- 1-0-0-0-0- 0- 1
and traverse it in reverse order using the same algorithm.
As for how to implement a Segmented Prefix Scan, I haven't implemented one myself but I found a couple of references that might be informative on how to do it: http://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/lectures/24_algorithms.pdf and http://www.cs.ucdavis.edu/research/tech-reports/2011/CSE-2011-4.pdf (see page 23)

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

Resources