How does fan out exactly affects split and merge in B+ trees?
If I have 1024 Bytes page and 8 byte key, and maybe 8 byte pointer, i can store around 64 keys in one page. Considering one page will be my node, so if i have a fan out of 80%, does it mean the split will happen after the node is 80% full like after 52 keys are inserted or only after the node overflows.
Same for merge, when do we merge the nodes if we have like 80% fan out, when the keys go less than half the size of node or 80% has something to do with it.
Splits and merges in B-trees of all kinds are usually driven by policies based on fullness criteria. It is best to think about fullness in terms of node space utilisation instead of key counts; fixed-size structures - where space utilisation is measured in terms of key counts and is thus equivalent to fanout - tend to occur only in academia and special contexts like in-memory B-trees on integers or hashes. In practice there are usually variable-size elements involved, beginning with variable-size keys that are subject to further size variation via things like prefix/suffix truncation and compression.
Splits almost invariably occur only when an update operation would result in an overflowed node. The difference between policies lies in how hard they try to shift keys to neighbouring nodes in order to avoid a split (looking at only one sibling or at both) and in how many keys they try to offload (one or several). Some locking strategies require preventive splitting/merging during initial descent, to guarantee that no splits or merges can occur on the way back up. In that case the decision must be made based on minimum/maximum possible key sizes instead of looking at the sizes of actual keys.
Some strategies only split when they have two full neighbouring nodes which they then split into three nodes, and they merge only if they have three neighbouring nodes that are on the verge of underflow (resulting in two full nodes). The net result is a high minimum utilisation of 2/3, with an average utilisation of 3/4 or higher. However, the increased complexity of the update algorithms is rarely worth the candle.
On the whole, the criteria can be summarised thus: split when a node threatens to overflow and offloading of keys to neighbours is not possible, merge when a node threatens to underflow and none of the neighbours can donate a key.
Related
Question: Is there a data structure for [key,value] entries that has very fast look-ups, with limited size, that can throw away entries based on how old they are?
This is needed for the following situation:
The program is running an optimization where nodes are evaluated.
Evaluation of the nodes is relatively expensive.
The optimization reaches the same nodes quite often.
The re-visiting of nodes has temporal locality (older nodes are less likely to be seen again)
keys are small sets of values (1 to 10 integers), values a single integer
I want to remember the original evaluation of nodes to speed up the performance.
But not all visited nodes can be stored as that would takes too much memory.
I want to implement a graph where nodes are items and outward edges are the popularity of the next item. (e.g. I've done this task, what are the most popular tasks performed after it?) My initial algorithm simply incremented a popularity relationship each time it was traversed, but this yields three problems:
First, there are potentially many (100,000+) items of up to 10-15 Unicode characters, and I'd like to keep the total space as small as possible. Second, each (relationship) number runs the risk of overflowing, and dividing the popularity by two each time a value approaches the edge is time consuming and loses a lot of accuracy as far as the differences in the popularity of other items.
(i.e. Assume a-4->b and a-5->c and a-255->d. With one byte, a-255->d will overflow at the next increment, but dividing the relationships by 2 will give: a-2->b and a-2->c and a-127->d)
Third, it makes it difficult for less popular edges to gain popularity.
I've also thought about a queue-like structure where each transition en-queues the next item and de-queues the oldest one when the queue is full. However, this has the problem of being too dynamic if the queue is too small and of eating up a HUGE amount of space, even if the queue is only ten elements.
Any ideas on algorithms/data structures for approaching this problem?
I have multiple robots, which explore an occupancy grid through some algorithm. I am trying to save the order of explored nodes. But I am not sure, which data structure can be used to save them efficiently.
I first thought of an tree, but the order can be repeatable like 1, 2, 5, 1. So, I feel, it may be too complex to store such an order in tree form. Then, I thought of an array, but it can be too much expensive in terms of memory for large grids.
I am a bit confused now. What data structure would be better(suppose grid is of 10,000 nodes). But the point is the order of explored nodes will be greater than 10,000 in this case as there will be overlap.
Thanks!
A tree makes little sense here with a need to preserve insertion order and the need to allow duplicates. Basically, as I understand it, we want to store the path in which the robot has traveled in the tightest form we can.
A compact, contiguous kind of sequence ends up making the most sense here (array, e.g.). It's cheaper than any linked structure (tree included) since there are no links to store.
There's little we can do to compact memory usage any further.
However, an unrolled list might be helpful here. Since it's not one giant contiguous block and instead a series of smaller blocks (ex: 4 kilobytes each) linked together, you can start, say, off-loading blocks at the front of the list to disk if you want to reduce memory use. The link overhead is trivial since we're only storing a link every N elements, where N could be some large number.
I read this on wikipedia:
In B-trees, internal (non-leaf) nodes can have a variable number of
child nodes within some pre-defined range. When data is inserted or
removed from a node, its number of child nodes changes. In order to
maintain the pre-defined range, internal nodes may be joined or split.
Because a range of child nodes is permitted, B-trees do not need
re-balancing as frequently as other self-balancing search trees, but
may waste some space, since nodes are not entirely full.
We have to specify this range for B trees. Even when I looked up CLRS (Intro to Algorithms), it seemed to make to use of arrays for keys and children. My question is- is there any way to reduce this wastage in space by defining the keys and children as lists instead of predetermined arrays? Is this too much of a hassle?
Also, for the life of me I'm not able to get a decent psedocode on btreeDeleteNode. Any help here is appreciated too.
When you say "lists", do you mean linked lists?
An array of some kind of element takes up one element's worth of memory per slot, whether that slot is filled or not. A linked list only takes up memory for elements it actually contains, but for each one, it takes up one element's worth of memory, plus the size of one pointer (two if it's a doubly-linked list, unless you can use the xor trick to overlap them).
If you are storing pointers, and using a singly-linked list, then each list link is twice the size of each array slot. That means that unless the list is less than half full, a linked list will use more memory, not less.
If you're using a language whose runtime has per-object overhead (like Java, and like C unless you are handling memory allocation yourself), then you will also have to pay for that overhead on each list link, but only once on an array, and the ratio is even worse.
I would suggest that your balancing algorithm should keep tree nodes at least half full. If you split a node when it is full, you will create two half-full nodes. You then need to merge adjacent nodes when they are less than half full. You can then use an array, safe in the knowledge that it is more efficient than a linked list.
No idea about the details of deletion, sorry!
B-Tree node has an important characteristic, all keys in the node is sorted. When finding a specific key, binary search is used to find the right position. Using binary search keeps the complexity of search algorithm in B-Tree O(logn).
If you replace the preallocated array with some kind of linked list, you lost the ordering. Unless you use some complex data structures, like skip list, to keep the search algorithm with O(logn). But it's totally unnecessary, skip list itself is better.
I have explored the definitions of T-trees and B-/B+ trees. From papers on the web I understand that B-trees perform better in hierarchical memory, such as disk drives and cached memory.
What I can not understand is why T-trees were/are used even for flat memory?
They are advertised as space efficient alternative to AVL trees.
In the worst case, all leaf nodes of a T-tree contain just one element and all internal nodes contain the minimum amount allowed, which is close to full. This means that on average only half of the allocated space is utilized. Unless I am mistaken, this is the same utilization as the worst case of B-trees, when the nodes of a B-tree are half full.
Assuming that both trees store the keys locally in the nodes, but use pointers to refer to the records, the only difference is that B-trees have to store pointers for each of the branches. This would generally cause up to 50% overhead or less (over T-trees), depending on the size of the keys. In fact, this is close to the overhead expected in AVL trees, assuming no parent pointer, records embedded in the nodes, keys embedded in the records. Is this the expected efficiency gain that prevents us from using B-trees instead?
T-trees are usually implemented on top of AVL trees. AVL trees are more balanced than B-trees. Can this be connected with the application of T-trees?
I can give you a personal story that covers half of the answer, that is, why I wrote some Pascal code to program B+ trees some 18 years ago.
my target system was a PC with two disk drives, I had to store an index on non volatile memory and I wanted to understand better what I was learning at university. I was very dissatisfied with the performance of a commercial package, probably DBase III, or some Fox product, I can't remember.
anyhow: I needed these operations:
lookup
insertion
deletion
next item
previous item
maximum size of index was not known
so data had to reside on disk
each access to the support had high cost
reading a whole block cost the same as reading one byte
B+-trees made that small slow PC really fly through the data!
the leafs had two extra pointers so they formed a doubly linked list, for sequential searches.
In reality the difference lies in the system you use. As my tutor in university commented it : if your problem lies in memory shortage, or in hdd shortage will determine which tree and in which implementation you will use. Most probably it will be B+ tree.
Because there are hundreds of implementations, for instance with 2direction queue and one directional queues where you need to loop thought elements, and also there are multiple ways to store the index and retrieve it will determine the real cons and mins of any implementation.