How to decide order of a B-tree - algorithm

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?

Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

Related

is not the benefit of B-Tree lost when it is saved in File?

I was reading about B-Tree and it was interesting to know that it is specifically built for storing in secondary memory. But i am little puzzled with few points:
If we save the B-Tree in secondary memory (via serialization in Java) is not the advantage of B-Tree lost ? because once the node is serialized we will not have access to reference to child nodes (as we get in primary memory). So what that means is, we will have to read all the nodes one by one (as no reference is available for child). And if we have to read all the nodes then whats the advantage of the tree ? i mean, in that case we are not using the binary search on the tree. Any thoughts ?
When a B-Tree is used on disk, it is not read from file, deserialized, modified, and serialized, and written back.
A B-Tree on disk is a disk-based data structure consisting of blocks of data, and those blocks are read and written one block at a time. Typically:
Each node in the B-Tree is a block of data (bytes). Blocks have fixed sizes.
Blocks are addressed by their position in the file, if a file is used, or by their sector address if B-Tree blocks are mapped directly to disk sectors.
A "pointer to a child node" is just a number that is the node's block address.
Blocks are large. Typically large enough to have 1000 children or more. That's because reading a block is expensive, but the cost doesn't depend much on the block size. By keeping blocks big enough so that there are only 3 or 4 levels in the whole tree, we minimize the number of reads or writes required to access any specific item.
Caching is usually used so that most accesses only need to touch the lowest level of the tree on disk.
So to find an item in a B-Tree, you would read the root block (it will probably come out of cache), look through it to find the appropriate child block and read that (again probably out of cache), maybe do that again, finally read the appropriate leaf block and extract the data.

compare B+tree implementation: storing internal nodes on disk

is there any implementation where internal nodes of B+tree is also stored on disk? I am just wondering if any one is aware of such an implementation or see real advantage doing it this way? Normally, one stores the leaf nodes on disk and develop the B+ tree as per need.
But it is also possible to save the current state of B+tree's internal nodes (by replacing the pointers by disk block number it points to): I see there are other challenges like keeping the internal nodes in memory in sync with the disk blocks: but the B+ tree may be implemented on nvram or say battery backed dram or some other method to keep it in sync.
Just wondering if anyone has already implemented it this way like linux's bcache or another implementation?
cheers, cforfun!
All persistent B+Tree implementations I've ever seen - as opposed to pure 'transient' in-memory structures - store both node types on disk.
Not doing so would require scanning the all the data (the external nodes, a.k.a. 'sequence set') on every load in order to rebuild the index, something that is feasible only when you're dealing with piddling small amounts of data or very special circumstances.
I've seen single-user implementations that sync the disk image only when the page manager ejects a dirty page and on program shutdown, which has the effect that often-used internal nodes - which are rarely replaced/ejected - can go without sync-to-disk for a long time. This is somewhat justified by the fact that internal ('index') nodes can be rebuilt after a crash, so that only the external ('data') nodes need the full fault-tolerant persistence treatment. The advantage of such schemes is that they eliminate the wasted writes for nodes close to the root whose update frequency is fairly high. Think SSDs, for example.
One way of increasing disk efficiency for persisted in-memory structures is to persist only the log to disk, and to rebuild the whole tree from the log on each restart. One very successful Java package uses this approach to great advantage.

KD-Tree on secondary memory?

I know some of the range searching data structure, for example kd-tree, Range tree and quad-tree.
But all the implementation is in memory, how can I implementation them on secondary memory with high performance I/O efficiency?
Here is the condition:
1): a static set of points on the two dimension.
2): just for query, no inset or delete.
3): adapt for secondary memory.
Thanks.
If you can fit the tree into memory during construction:
Build the kd-tree.
Bottom, up, collect as many points as possible that fit into a block of your hardware size.
Write the data to this block.
Repeat 2.-3. recursively, until you've written all data to disk.
When querying, load a page from disk, process this part of the tree until you reach a reference to another page. Then load this page and continue there.
Alternatively, you can do the same top-down, but then you will likely need more disk space. In above approach, only the root page can be near-empty.

Is it good to create virtual machines(nodes) to get better performance on cassandra?

I know Cassandra is good in multiple nodes set up. The more nodes,the better performance. If I have two dedicated servers with same hardware, it would be good I create some virtual machines in both of them to have more nodes, or not?
For example I have two dedicated server with this specifications:
1TB hard drive
64 GB RAM
8 core CPU
then create 8 virtual machine(nodes) in both of them. each of them has:
~150GB hard drive
8 GB RAM
share 8 core CPU
So I have 16 nodes. Are these 16 nodes had better performance than 2 nodes with this two dedicated server?
In the other word which side of this trade off is better, more nodes with lower hardware or two stronger nodes?
I know it should be tested, but I want to know basically is it reasonable or not?
Adding new nodes always adds some overhead, they need to communicate within each other and sync their data. Therefore, the more nodes you add, you'd expect the overhead to increase with adding each node. You'd add more nodes only in a situation where the existing number of nodes can't handle the input/output demands. Since in the situation you are describing , you'd be actually writing on the same disk, you'd actually effectively be slowing down your cluster by adding more nodes.
Imagine the situation: you have a server, it receives some data and then writes it on disk. Now imagine the same situation, where the disk is shared between two servers and they both write the same information at the almost same time on the same disk. The two servers also use cpu cycles to communicate between each other that the data has been written so they can sync up. I think this is a sufficient enough information to describe to you why what you are thinking is not a good idea if you can avoid it.
EDIT:
Of course, this is the information only in layman's terms, C* has a very nice architecture in which data is actually spread according to an algorithm to a certain range of nodes (not all of them) and when you are querying for a specific key, the algorithm actually can tell you where to find the data. With that said, when you add and remove nodes, the new nodes have to communicate with the cluster that they want to share 'the burden' and as a result, a recalculation of what is known as a 'token-ring' takes place at the end of which data may be shuffled around so it is accessible in a predictable way.
You can take a look at this:
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2
But in general, there is indeed some overhead when nodes communicate with each other, but the number of the nodes would almost never negatively or positively impact your query speed dramatically if you are querying for a single key.
"I know it should be tested, but I want to know basically is it reasonable or not?"
That will answer most of your assumptions.
The basic advantage of using cassandra is availability. If you are planning to have just two dedicated servers, then there is a question mark on your availability of data. Considering the worst case, you always have just two replicas of data at any point of time.
My take is to go for a nicely split dedicated set up in small chunks. Everything boils down to your use case.
1.If you have a lot of data flowing in and if you consider data as king(in such a case , you need more replicas to handle in case of failures), i would prefer a high end distributed set up.
2.If you are looking for the other way around(data is not your forte and your data is just another part of your set up), you shall just go for the set up what you have mentioned.
3.If you have a cost constraint and if you are a start up with a minimal data that is important to you, set up in two nodes what you have with replication of 2(Simple Strategy ) and replication of 1(Network Topology)

Knn search for large data?

I'm interested in performing knn search on large dataset.
There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?
I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:
Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.
If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.
It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.
If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.
Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.

Resources