Question: Is there a data structure for [key,value] entries that has very fast look-ups, with limited size, that can throw away entries based on how old they are?
This is needed for the following situation:
The program is running an optimization where nodes are evaluated.
Evaluation of the nodes is relatively expensive.
The optimization reaches the same nodes quite often.
The re-visiting of nodes has temporal locality (older nodes are less likely to be seen again)
keys are small sets of values (1 to 10 integers), values a single integer
I want to remember the original evaluation of nodes to speed up the performance.
But not all visited nodes can be stored as that would takes too much memory.
Related
I have multiple robots, which explore an occupancy grid through some algorithm. I am trying to save the order of explored nodes. But I am not sure, which data structure can be used to save them efficiently.
I first thought of an tree, but the order can be repeatable like 1, 2, 5, 1. So, I feel, it may be too complex to store such an order in tree form. Then, I thought of an array, but it can be too much expensive in terms of memory for large grids.
I am a bit confused now. What data structure would be better(suppose grid is of 10,000 nodes). But the point is the order of explored nodes will be greater than 10,000 in this case as there will be overlap.
Thanks!
A tree makes little sense here with a need to preserve insertion order and the need to allow duplicates. Basically, as I understand it, we want to store the path in which the robot has traveled in the tightest form we can.
A compact, contiguous kind of sequence ends up making the most sense here (array, e.g.). It's cheaper than any linked structure (tree included) since there are no links to store.
There's little we can do to compact memory usage any further.
However, an unrolled list might be helpful here. Since it's not one giant contiguous block and instead a series of smaller blocks (ex: 4 kilobytes each) linked together, you can start, say, off-loading blocks at the front of the list to disk if you want to reduce memory use. The link overhead is trivial since we're only storing a link every N elements, where N could be some large number.
How does fan out exactly affects split and merge in B+ trees?
If I have 1024 Bytes page and 8 byte key, and maybe 8 byte pointer, i can store around 64 keys in one page. Considering one page will be my node, so if i have a fan out of 80%, does it mean the split will happen after the node is 80% full like after 52 keys are inserted or only after the node overflows.
Same for merge, when do we merge the nodes if we have like 80% fan out, when the keys go less than half the size of node or 80% has something to do with it.
Splits and merges in B-trees of all kinds are usually driven by policies based on fullness criteria. It is best to think about fullness in terms of node space utilisation instead of key counts; fixed-size structures - where space utilisation is measured in terms of key counts and is thus equivalent to fanout - tend to occur only in academia and special contexts like in-memory B-trees on integers or hashes. In practice there are usually variable-size elements involved, beginning with variable-size keys that are subject to further size variation via things like prefix/suffix truncation and compression.
Splits almost invariably occur only when an update operation would result in an overflowed node. The difference between policies lies in how hard they try to shift keys to neighbouring nodes in order to avoid a split (looking at only one sibling or at both) and in how many keys they try to offload (one or several). Some locking strategies require preventive splitting/merging during initial descent, to guarantee that no splits or merges can occur on the way back up. In that case the decision must be made based on minimum/maximum possible key sizes instead of looking at the sizes of actual keys.
Some strategies only split when they have two full neighbouring nodes which they then split into three nodes, and they merge only if they have three neighbouring nodes that are on the verge of underflow (resulting in two full nodes). The net result is a high minimum utilisation of 2/3, with an average utilisation of 3/4 or higher. However, the increased complexity of the update algorithms is rarely worth the candle.
On the whole, the criteria can be summarised thus: split when a node threatens to overflow and offloading of keys to neighbours is not possible, merge when a node threatens to underflow and none of the neighbours can donate a key.
I know this is a common question and I saw a few threads in Stack Overflow but still couldn't get it.
Here is an accepted answer from Stack overflow:
" Disk seeks are expensive. B-Tree structure is designed specifically to
avoid disk seeks as much as possible. Therefore B-Tree packs much more
keys/pointers into a single node than a binary tree. This property
makes the tree very flat. Usually most B-Trees are only 3 or 4 levels
deep and the root node can be easily cached. This requires only 2-3
seeks to find anything in the tree. Leaves are also "packed" this way,
so iterating a tree (e.g. full scan or range scan) is very efficient,
because you read hundreds/thousands data-rows per single block (seek).
In binary tree of the same capacity, you'd have several tens of levels
and sequential visiting every single value would require at least one
seek. "
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST.
But these nodes are again stored as linked lists right?
I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Please someone explain it to me?
I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST. I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.
Isn't the same argument hold good for BSTs too? Except that the links will be downwards?
Basically, the idea behind using a B+tree in file systems is to reduce the number of disk reads. Imagine that all the blocks in a drive are stored as a sequentially allocated array. In order to search for a specific block you would have to do a linear scan and it would take O(n) every time to find a block. Right?
Now, imagine that you got smart and decided to use a BST, great! You would store all your blocks in a BST an that would take roughly O(log(n)) to find a block. Remember that every branch is a disk access, which is highly expensive!
But, we can do better! The problem now is that a BST is really "tall". Because every node only has a fanout (number of children) factor of 2, if we had to store N objects, our tree would be in the order of log(N) tall. So we would have to perform at most log(N) access to find our leaves.
The idea behind the B+tree structure is to increase the fanout factor (number of children), reducing the height of tree and, thus, reducing the number of disk access that we have to make in order to find a leave. Remember that every branch is a disk access. For instance, if you pack X keys in a node of a B+tree every node will point to at most X+1 children.
Also, remember that a B+tree is structured in a way that only the leaves store the actual data. That way, you can pack more keys in the internal nodes in order to fill up one disk block, that, for instance, stores one node of a B+tree. The more keys you pack in a node the more children it will point to and the shorter your tree will be, thus reducing the number of disk access in order to find one leave.
But these nodes are again stored as linked lists right?
Also, in a B+tree structure, sometimes the leaves are stored in a linked list fashion. Remember that only the leaves store the actual data. That way, with the linked list idea, when you have to perform a sequential access after finding one block you would do it faster than having to traverse the tree again in order to find the next block, right? The problem is that you still have to find the first block! And for that, the B+tree is way better than the linked list.
Imagine that if all the accesses were sequential and started in the first block of the disk, an array would be better than the linked list, because in a linked list you still have to deal with the pointers.
But, the majority of disk accesses, according to Tanenbaum, are not sequential and are accesses to files of small sizes (like 4KB or less). Imagine the time it would take if you had to traverse a linked list every time to access one block of 4KB...
This article explains it way better than me and uses pictures as well:
https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees
A B-tree node is essentially an array, of pairs {key, link}, of a fixed size which is read in one chunk, typically some number of disk blocks. The links are all downwards. At the bottom layer the links point to the associated records (assuming a B+-tree, as in any practical implementation).
I don't know where you got the linked list idea from.
Each node in a B-tree implemented in disk storage consists of a disk block (normally a handful of kilobytes) full of keys and "pointers" that are accessed as an array and not - as you said - a linked list. The block size is normally file-system dependent and chosen to use the file system's read and write operations efficiently. The pointers are not normal memory pointers, but rather disk addresses, again chosen to be easily used by the supporting file system.
The main reason for B-tree is how it behaves on changes. If you have permanent structure, BST is OK, but in that case Hash function is even better. In case of file systems, you want a structure which changes as a whole as little as possible on inserts or deletes, and where you can perform find operation with as little reads as possible - these properties have B-trees.
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?
What are some other uses of B+ trees besides database indexing?
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct?
No, the index is formed by the inner nodes (non-leaves). Depending on the implementation the leaves may contain either key/value pairs or key/pointer to value pairs. For example, a database index uses the latter, unless it is an IOT (Index Organized Table) in which case the values are inlined in the leaves. This depends mainly on whether the value is insanely large wrt the key.
Then how are lookups made?
In the general case where the root node is not a leaf (it does happen, at first), the root node contains a sorted array of N keys and N+1 pointers. You binary search for the two keys S0 and S1 such that S0 <= K < S1 (where K is what you are looking for) and this gives you the pointer to the next node.
You repeat the process until you (finally) hit a leaf node, which contains a sorted list of key-values pairs and make a last binary search pass on those.
If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
Binary trees are simpler to implement. One though cookie with B+Trees is to size the number of keys/pointers in inner nodes and the number of key/values pairs in leaves nodes. Another though cookie is to decide on the low and high watermark that leads to grouping two nodes or exploding one.
Binary trees also offer memory stability: an element inserted is not moved, at all, in memory. On the other hand, inserting an element in a B+Tree or removing one is likely to lead to elements shuffling
B+Trees are tailored for small keys/large values cases. They also require that keys can be duplicated (hopefully cheaply).
Could you guide me perhaps with some link to reading material?
I hope the rough algorithm I explained helped out, otherwise feel free to ask in the comments.
What are some other uses of B+ trees besides database indexing?
In the same vein: file-system indexing also benefits.
The idea is always the same: a B+Tree is really great with small keys/large values and caching. The idea is to have all the keys (inner nodes) in your fast memory (CPU Cache >> RAM >> Disk), and the B+Tree achieves that for large collections by pushing keys to the bottom. With all inner nodes in the fast memory, you only have one slow memory access at each search (to fetch the value).
B+ trees are better than binary tree all the dbms use them,
a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.
B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.
B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.
B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.
If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.
I have explored the definitions of T-trees and B-/B+ trees. From papers on the web I understand that B-trees perform better in hierarchical memory, such as disk drives and cached memory.
What I can not understand is why T-trees were/are used even for flat memory?
They are advertised as space efficient alternative to AVL trees.
In the worst case, all leaf nodes of a T-tree contain just one element and all internal nodes contain the minimum amount allowed, which is close to full. This means that on average only half of the allocated space is utilized. Unless I am mistaken, this is the same utilization as the worst case of B-trees, when the nodes of a B-tree are half full.
Assuming that both trees store the keys locally in the nodes, but use pointers to refer to the records, the only difference is that B-trees have to store pointers for each of the branches. This would generally cause up to 50% overhead or less (over T-trees), depending on the size of the keys. In fact, this is close to the overhead expected in AVL trees, assuming no parent pointer, records embedded in the nodes, keys embedded in the records. Is this the expected efficiency gain that prevents us from using B-trees instead?
T-trees are usually implemented on top of AVL trees. AVL trees are more balanced than B-trees. Can this be connected with the application of T-trees?
I can give you a personal story that covers half of the answer, that is, why I wrote some Pascal code to program B+ trees some 18 years ago.
my target system was a PC with two disk drives, I had to store an index on non volatile memory and I wanted to understand better what I was learning at university. I was very dissatisfied with the performance of a commercial package, probably DBase III, or some Fox product, I can't remember.
anyhow: I needed these operations:
lookup
insertion
deletion
next item
previous item
maximum size of index was not known
so data had to reside on disk
each access to the support had high cost
reading a whole block cost the same as reading one byte
B+-trees made that small slow PC really fly through the data!
the leafs had two extra pointers so they formed a doubly linked list, for sequential searches.
In reality the difference lies in the system you use. As my tutor in university commented it : if your problem lies in memory shortage, or in hdd shortage will determine which tree and in which implementation you will use. Most probably it will be B+ tree.
Because there are hundreds of implementations, for instance with 2direction queue and one directional queues where you need to loop thought elements, and also there are multiple ways to store the index and retrieve it will determine the real cons and mins of any implementation.