I'm reading an article about why we need B-Tree. It's telling me that B-Tree can decrease the number of IO whereas other trees, such as Red-Black tree, can't. And the number of IO equals to the height of B-Tree.
Here is an example.
We are looking for the value 9. With the B-Tree, there are three times of IO, but with the binary tree, there are maybe four times of IO.
Now I'm confused. Why can the B-Tree ensure that there are at most only three times of IO? In other words, who can ensure that the node 3 and the node 7 must be located at the same disk block? I've thought that the data structure of each node in a B-Tree may be an array so that they were sequential, and sequential data is normally located at the same disk block (seriously, I'm not sure...), but it seems that the data structure of the node in a B-Tree is a List, which means that they are not sequential. So as my understanding, it's also possible to generate two times of IO while accessing 3 and 7. In this case, can't we say that accessing 9 may also need 4 times of IO?
On disk, every B-tree node is in a single contiguous block, and every node contains thousands of keys.
For each key there is a pointer to the corresponding node in the next level. On disk, this "pointer" is the address of the contiguous block that contains the target node.
So, for example, if there are 10^9 leaf-level keys, there could be 1000000 leaf-level nodes. On the parent level, there are 1000000 keys that point to those nodes, distributed among 1000 parent nodes. On the root level, there are 1000 keys in a single node.
Related
I was working through a textbook and got stuck on this question:
"Consider a B+ Tree where each leaf block can contain a maximum of 3 records, each internal block can contain a maximum of 3 keys, all record blocks in the tree are fully occupied with 3 records each and the records have key values: 5,10,15,..., and there are 4 record blocks in the file"
Question: "Draw this tree in a single diagram"
So far I've added all the records in the leaf level, there's 4 blocks with 3 keys so 12 values total, so my leaf level has all multiples of 5 from 5 to 60. I'm now stuck on what to add on the level above it (internal block).
You have already done the right thing for the leaf level. There is only one internal block needed, which will have 4 pointers to those leaf blocks, and 3 keys. Those 3 keys are keys that typically are copies from the least keys in the blocks below that block. No key of the first block is repeated in that internal block, only of the other blocks.
One way of illustrating this structure, is like this:
Often the leaf blocks are linked together, in a singly or doubly linked list, although this is not a strict requirement for B+ trees. I have not depicted this above.
From the wiki
The maximum number of children each node can have depends on the type of heap, but in many types it is at most two, which is known as a binary heap.
I can't understand why in many types the node in heap at most has only two children? Why three children or four children and so on is not common? Thanks~
It's not true that most types of heap have at most two children per node, but it is true that the binary heap -- which does have at most two children per node -- is the most commonly implemented type. It's the most commonly implemented type because it is simple, cache-friendly, and memory-efficient.
The data structures used for binary heaps could be used with a different number of children per node. The common operations in an x-ary heap would still take O(log N) time, if we consider x to be constant. To decide on the best x, however, we have to let it vary, and in that case common operations take O(x * log N / log x) time.
To determine the most efficient number of children per node, we can choose x to minimize the factor x/log x.
If you plot that you can see that the best number of children per node is actually 3 (the minimum is at x=e, but we require an integer):
... but the difference between 2 and 3 is not significant, and the code is simpler using 2 children per node, so that is the common practice.
Disclaimer: I really believe that this is not a duplicate of similar questions. I've read those, and they (mostly) recommend using a heap or a priority queue. My question is more of the "I don't understand how those would work in this case" kind.
In short:
I'm referring to the typical A* (A-star) pathfinding algorithm, as described (for example) on Wikipedia:
https://en.wikipedia.org/wiki/A*_search_algorithm
More specifically, I'm wondering about what's the best data structure (which can be a single well known data structure, or a combination of those) to use so that you never have O(n) performance on any of the operations that the algorithm requires to do on the open list.
As far as I understand (mostly from the Wikipedia article), the operations needed to be done on the open list are as follows:
The elements in this list need to be Node instances with the following properties:
position (or coordinates). For the sake of argument, let's say this is a positive integer ranging in value from 0 to 64516 (I'm limiting my A* area size to 254x254, which means that any set of coordinates can be bit-encoded on 16 bits)
F score. This is positive floating point value.
Given these, the operations are:
Add a node to the open list: if a node with the same position (coordinates) exists (but, potentially, with a different F score), replace it.
Retrieve (and remove) from the open list the node with the lowest F score
(Check if exists and) retrieve from the list a node for a given position (coordinates)
As far as I can see, the problem with using a Heap or Priority Queue for the open list are:
These data structure will use the F-score as sorting criteria
As such, adding a node to this kind of data structure is problematic: how do you check optimally that a node with a similar set of coordinates (but a different F Score) doesn't already exist. Furthermore, even if you somehow are able to do this check, if you actually find such a node, but it is not on the top of the Heap/Queue, how to you optimally remove it such that the Heap/Queue keeps its correct order
Also, checking for existence and removing a node based on its position is not optimal or even possible: if we use a Priority Queue, we have to check every node in it, and remove the corresponding one if found. For a heap, if such a removal is necessary, I imagine that all remaining elements need to be extracted and re-inserted, so that the heap still remains a heap.
The only remaining operating where such a data structure would be good is when we want to remove the node with the lowest F-score. In this case the operation would be O(Log(n)).
Also, if we make a custom data structure, such as one that uses a Hashtable (with position as key) and a Priority Queue, we would still have some operations that require suboptimal processing on either of these: In order to keep them in sync (both should have the same nodes in them), for a given operation, that operation will always be subomtimal on one of the data structures: adding or removing a node by position would be fast on the Hashtable but slow on the Priority Queue. Removing the node with the lowest F score would be fast on the Priority Queue but slow on the Hashtable.
What I've done is make a custom Hashtable for the nodes that uses their position as key, that also keeps track of the current node with the lowest F score. When adding a new node, it checks if its F score is lower than the currently stored lowest F score node, and if so, it replaces it. The problem with this data structure comes when you want to remove a node (whether by position or the lowest F scored one). In this case, in order to update the field holding the current lowest F score node, I need to iterate through all the remaining node in order to find which one has the lowest F score now.
So my question is: is there a better way to store these ?
You can combine the hash table and the heap without slow operations showing up.
Have the hash table map position to index in the heap instead of node.
Any update to the heap can sync itself (which requires the heap to know about the hash table, so this is invasive and not just a wrapper around two off-the-shelf implementations) to the hash table with as many updates (each O(1), obviously) as the number of items that move in the heap, of course only log n items can move for an insertion, remove-min or update-key. The hash table finds the node (in the heap) to update the key of for the parent-updating/G-changing step of A* so that's fast too.
How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.
I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)