Extending trie to higher number of leaves

Extending trie to higher number of leaves - algorithm

I have to make a dictionary using tries, the number of letters in the alphabet will increase from 26 to 120, and hence the numb er of leaf nodes will increase exponentially. What optimisations can I use so that my lookup, insertion and deletion time doesn't increase exponentially?
EDIT
Making the question clearer, sorry for the lack of details
I am using a multiway trie like radix tree and making some modifications to it. My question is if I know that the word size will increase (for sure) from 26 to 120, it will increase the depth of the tree. Is it possible to decrease the increase in depth by increasing the key to more than 64 bits (the register can gold maximum 64 bits)?

It's often better to use binary tries with path compression (Patricia tries) based on the binary representation of your keys. That way you get the benefits of the smallest possible alphabet, and you still only have 2 nodes (one leaf and one internal) per key.

Though there may be some optimizations otherwise but our lookup, insertion and deletion time will not increase exponentially. Increasing your alphabet set only means that your each trie node will be bigger now. The path to each word will still have the same length which is equal to the letters in the word.

Related

Base 3 or more search? [duplicate]

I recently heard about ternary search in which we divide an array into 3 parts and compare. Here there will be two comparisons but it reduces the array to n/3. Why don't people use this much?

Actually, people do use k-ary trees for arbitrary k.
This is, however, a tradeoff.
To find an element in a k-ary tree, you need around k*ln(N)/ln(k) operations (remember the change-of-base formula). The larger your k is, the more overall operations you need.
The logical extension of what you are saying is "why don't people use an N-ary tree for N data elements?". Which, of course, would be an array.

A ternary search will still give you the same asymptotic complexity O(log N) search time, and adds complexity to the implementation.
The same argument can be said for why you would not want a quad search or any other higher order.

Searching 1 billion (a US billion - 1,000,000,000) sorted items would take an average of about 15 compares with binary search and about 9 compares with a ternary search - not a huge advantage. And note that each 'ternary compare' might involve 2 actual comparisons.

Wow. The top voted answers miss the boat on this one, I think.
Your CPU doesn't support ternary logic as a single operation; it breaks ternary logic into several steps of binary logic. The most optimal code for the CPU is binary logic. If chips were common that supported ternary logic as a single operation, you'd be right.
B-Trees can have multiple branches at each node; a order-3 B-tree is ternary logic. Each step down the tree will take two comparisons instead of one, and this will probably cause it to be slower in CPU time.
B-Trees, however, are pretty common. If you assume that every node in the tree will be stored somewhere separately on disk, you're going to spend most of your time reading from disk... and the CPU won't be a bottleneck, but the disk will be. So you take a B-tree with 100,000 children per node, or whatever else will barely fit into one block of memory. B-trees with that kind of branching factor would rarely be more than three nodes high, and you'd only have three disk reads - three stops at a bottleneck - to search an enormous, enormous dataset.
Reviewing:
Ternary trees aren't supported by hardware, so they run less quickly.
B-tress with orders much, much, much higher than 3 are common for disk-optimization of large datasets; once you've gone past 2, go higher than 3.

The only way a ternary search can be faster than a binary search is if a 3-way partition determination can be done for less than about 1.55 times the cost of a 2-way comparison. If the items are stored in a sorted array, the 3-way determination will on average be 1.66 times as expensive as a 2-way determination. If information is stored in a tree, however, the cost to fetch information is high relative to the cost of actually comparing, and cache locality means the cost of randomly fetching a pair of related data is not much worse than the cost of fetching a single datum, a ternary or n-way tree may improve efficiency greatly.

What makes you think Ternary search should be faster?
Average number of comparisons:
in ternary search = ((1/3)*1 + (2/3)*2) * ln(n)/ln(3) ~ 1.517*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
Worst number of comparisons:
in ternary search = 2 * ln(n)/ln(3) ~ 1.820*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
So it looks like ternary search is worse.

Also, note that this sequence generalizes to linear search if we go on
Binary search
Ternary search
...
...
n-ary search ≡ linear search
So, in an n-ary search, we will have "one only COMPARE" which might take upto n actual comparisons.

"Terinary" (ternary?) search is more efficient in the best case, which would involve searching for the first element (or perhaps the last, depending on which comparison you do first). For elements farther from the end you're checking first, while two comparisons would narrow the array by 2/3 each time, the same two comparisons with binary search would narrow the search space by 3/4.
Add to that, binary search is simpler. You just compare and get one half or the other, rather than compare, if less than get the first third, else compare, if less than get the second third, else get the last third.

Ternary search can be effectively used on parallel architectures - FPGAs and ASICs. For example if internal FPGA memory required for search is less than half of the FPGA resource, you can make a duplicate memory block. This would allow to simultaneously access two different memory addresses and do all comparisons in a single clock cycle. This is one of the reasons why 100MHz FPGA can sometimes outperform the 4GHz CPU :)

Here's some random experimental evidence that I haven't vetted at all showing that it's slower than binary search.

Almost all textbooks and websites on binary search trees do not really talk about binary trees! They show you ternary search trees! True binary trees store data in their leaves not internal nodes (except for keys to navigate). Some call these leaf trees and make the distinction between node trees shown in textbooks:
J. Nievergelt, C.-K. Wong: Upper Bounds for the Total Path Length of Binary Trees,
Journal ACM 20 (1973) 1–6.
The following about this is from Peter Brass's book on data structures.
2.1 Two Models of Search Trees
In the outline just given, we supressed an important point that at first seems
trivial, but indeed it leads to two different models of search trees, either of
which can be combined with much of the following material, but one of which
is strongly preferable.
If we compare in each node the query key with the key contained in the
node and follow the left branch if the query key is smaller and the right branch
if the query key is larger, then what happens if they are equal? The two models
of search trees are as follows:
Take left branch if query key is smaller than node key; otherwise take the
right branch, until you reach a leaf of the tree. The keys in the interior node
of the tree are only for comparison; all the objects are in the leaves.
Take left branch if query key is smaller than node key; take the right branch
if the query key is larger than the node key; and take the object contained
in the node if they are equal.
This minor point has a number of consequences:
{ In model 1, the underlying tree is a binary tree, whereas in model 2, each
tree node is really a ternary node with a special middle neighbor.
{ In model 1, each interior node has a left and a right subtree (each possibly a
leaf node of the tree), whereas in model 2, we have to allow incomplete
nodes, where left or right subtree might be missing, and only the
comparison object and key are guaranteed to exist.
So the structure of a search tree of model 1 is more regular than that of a tree
of model 2; this is, at least for the implementation, a clear advantage.
{ In model 1, traversing an interior node requires only one comparison,
whereas in model 2, we need two comparisons to check the three
possibilities.
Indeed, trees of the same height in models 1 and 2 contain at most approximately
the same number of objects, but one needs twice as many comparisons in model
2 to reach the deepest objects of the tree. Of course, in model 2, there are also
some objects that are reached much earlier; the object in the root is found
with only two comparisons, but almost all objects are on or near the deepest
level.
Theorem. A tree of height h and model 1 contains at most 2^h objects.
A tree of height h and model 2 contains at most 2^h+1 − 1 objects.
This is easily seen because the tree of height h has as left and right subtrees a
tree of height at most h − 1 each, and in model 2 one additional object between
them.
{ In model 1, keys in interior nodes serve only for comparisons and may
reappear in the leaves for the identification of the objects. In model 2, each
key appears only once, together with its object.
It is even possible in model 1 that there are keys used for comparison that
do not belong to any object, for example, if the object has been deleted. By
conceptually separating these functions of comparison and identification, this
is not surprising, and in later structures we might even need to define artificial
tests not corresponding to any object, just to get a good division of the search
space. All keys used for comparison are necessarily distinct because in a model
1 tree, each interior node has nonempty left and right subtrees. So each key
occurs at most twice, once as comparison key and once as identification key in
the leaf.
Model 2 became the preferred textbook version because in most textbooks
the distinction between object and its key is not made: the key is the object.
Then it becomes unnatural to duplicate the key in the tree structure. But in
all real applications, the distinction between key and object is quite important.
One almost never wishes to keep track of just a set of numbers; the numbers
are normally associated with some further information, which is often much
larger than the key itself.

You may have heard ternary search being used in those riddles that involve weighing things on scales. Those scales can return 3 answers: left is lighter, both are the same, or left is heavier. So in a ternary search, it only takes 1 comparison.
However, computers use boolean logic, which only has 2 answers. To do the ternary search, you'd actually have to do 2 comparisons instead of 1.
I guess there are some cases where this is still faster as earlier posters mentioned, but you can see that ternary search isn't always better, and it's more confusing and less natural to implement on a computer.

Theoretically the minimum of k/ln(k) is achieved at e and since 3 is closer to e than 2 it requires less comparisons. You can check that 3/ln(3) = 2.73.. and 2/ln(2) = 2.88.. The reason why binary search could be faster is that the code for it will have less branches and will run faster on modern CPUs.

I have just posted a blog about the ternary search and I have shown some results. I have also provided some initial level implementations on my git repo I totally agree with every one about the theory part of the ternary search but why not give it a try? As per the implementation that part is easy enough if you have three years of coding experience.
I found that if you have huge data set and you need to search it many times ternary search has an advantage.
If you think you can do better with a ternary search go for it.

Although you get the same big-O complexity (ln n) in both search trees, the difference is in the constants. You have to do more comparisons for a ternary search tree at each level. So the difference boils down to k/ln(k) for a k-ary search tree. This has a minimum value at e=2.7 and k=2 provides the optimal result.

Fan out in B+ trees

How does fan out exactly affects split and merge in B+ trees?
If I have 1024 Bytes page and 8 byte key, and maybe 8 byte pointer, i can store around 64 keys in one page. Considering one page will be my node, so if i have a fan out of 80%, does it mean the split will happen after the node is 80% full like after 52 keys are inserted or only after the node overflows.
Same for merge, when do we merge the nodes if we have like 80% fan out, when the keys go less than half the size of node or 80% has something to do with it.

Splits and merges in B-trees of all kinds are usually driven by policies based on fullness criteria. It is best to think about fullness in terms of node space utilisation instead of key counts; fixed-size structures - where space utilisation is measured in terms of key counts and is thus equivalent to fanout - tend to occur only in academia and special contexts like in-memory B-trees on integers or hashes. In practice there are usually variable-size elements involved, beginning with variable-size keys that are subject to further size variation via things like prefix/suffix truncation and compression.
Splits almost invariably occur only when an update operation would result in an overflowed node. The difference between policies lies in how hard they try to shift keys to neighbouring nodes in order to avoid a split (looking at only one sibling or at both) and in how many keys they try to offload (one or several). Some locking strategies require preventive splitting/merging during initial descent, to guarantee that no splits or merges can occur on the way back up. In that case the decision must be made based on minimum/maximum possible key sizes instead of looking at the sizes of actual keys.
Some strategies only split when they have two full neighbouring nodes which they then split into three nodes, and they merge only if they have three neighbouring nodes that are on the verge of underflow (resulting in two full nodes). The net result is a high minimum utilisation of 2/3, with an average utilisation of 3/4 or higher. However, the increased complexity of the update algorithms is rarely worth the candle.
On the whole, the criteria can be summarised thus: split when a node threatens to overflow and offloading of keys to neighbours is not possible, merge when a node threatens to underflow and none of the neighbours can donate a key.

Is there only one correct answer in heapsort?

If starting with an empty heap representing a priority queue, where numbers have to be inserted in order and then represented as a binary tree, is there only one strict answer to that? I have tried different Java heap generators etc. and they are all giving me different answers.

If you mean the sorted representation, then it's obviously unique.
If you mean the binary tree representation, then yes it's unique too - a heap is a complete tree - binary tree in which every level, except possibly the last, is completely filled, and all nodes are as far left as possible.
After all heap operations, may they be insert, delete-max, build-heap, siftup, siftdown, the heap remains in a predictable state, we can know how it's binary tree representation will look like.
Can you give more details about how you got different answers?

What invariant do RRB-trees maintain?

Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?

They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.

#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.

Is DFS possible without recursion and extra space

both recursion as well as vector based iterative process can be used for DFS tree(b-tree) traversal. In both the cases, we need extra space either in the stack while recursive calls or in a vector. Does any technique exists which doesn't need space or needs min space?

If you are searching a binary tree, you can represent each path taken as a pattern of bits. Accordingly, you can keep searching downwards from a root point, and keep track of your progress by twiddling the bits in a binary word. You'll need one bit for each level.
e.g.
A
/ \
B C
/ \ \
D E F
If we start searching at A, the search specified by ABD is 00 (left - left); the next is 01, corresponding to left-right, or ABE; 10 is A-C-No child; and 11 is ACF.
Notice that the least significant bit indicates the direction to turn at the end of the traversal (that is, it selects the leaf); if the bit sequence is read the other way, this method would specify a breadth-first traversal.
You keep revisiting the same nodes, as a consequence of storing less information.
Note that this amounts to a linear scan of your data, thus destroying the benefits of having a tree. If you are too space constrained to be able to afford a stack, I suggest that you use a pre-allocated vector. If you keep that sorted, you can do things like binary searches, which will be rather more efficient than this.

Needed space for traversing tree can be done at worst in log(N) space. If you have nodes whit single branch it can be done using O(1) space/memory performance. But log(N) performance is really small. If you imagine having 2^300 nodes (can not fit in this universe), you have space requirements for searching that is like 300 and can fit in few kB of RAM.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio