compare B+tree implementation: storing internal nodes on disk - data-structures

is there any implementation where internal nodes of B+tree is also stored on disk? I am just wondering if any one is aware of such an implementation or see real advantage doing it this way? Normally, one stores the leaf nodes on disk and develop the B+ tree as per need.
But it is also possible to save the current state of B+tree's internal nodes (by replacing the pointers by disk block number it points to): I see there are other challenges like keeping the internal nodes in memory in sync with the disk blocks: but the B+ tree may be implemented on nvram or say battery backed dram or some other method to keep it in sync.
Just wondering if anyone has already implemented it this way like linux's bcache or another implementation?
cheers, cforfun!

All persistent B+Tree implementations I've ever seen - as opposed to pure 'transient' in-memory structures - store both node types on disk.
Not doing so would require scanning the all the data (the external nodes, a.k.a. 'sequence set') on every load in order to rebuild the index, something that is feasible only when you're dealing with piddling small amounts of data or very special circumstances.
I've seen single-user implementations that sync the disk image only when the page manager ejects a dirty page and on program shutdown, which has the effect that often-used internal nodes - which are rarely replaced/ejected - can go without sync-to-disk for a long time. This is somewhat justified by the fact that internal ('index') nodes can be rebuilt after a crash, so that only the external ('data') nodes need the full fault-tolerant persistence treatment. The advantage of such schemes is that they eliminate the wasted writes for nodes close to the root whose update frequency is fairly high. Think SSDs, for example.
One way of increasing disk efficiency for persisted in-memory structures is to persist only the log to disk, and to rebuild the whole tree from the log on each restart. One very successful Java package uses this approach to great advantage.

Related

What is locality in Graph Matching problem and Distributed models?

I’m a beginner in the field of Graph Matching and Parallel Computing. I read a paper that talks about an efficient parallel matching algorithm. They explained the importance of the locality, but I don't know it represents what? and What is good and bad locality?
Our distributed memory parallelization (using MPI) on p processing elements (PEs or MPI processes) assigns nodes to PEs and stores all edges incident to a node locally. This can be done in a load balanced way if no node has degree exceeding m/p. The second pass of the basic algorithm from Section 2 has to exchange information on candidate edges that cross a PE boundary. In the worst case, this can involve all edges handled by a PE, i.e., we can expect better performance if we manage to keep most edges locally. In our experiments, one PE owns nodes whose numbers are a consecutive range of the input numbers. Thus, depending on how much locality the input numbering contains we have a highly local or a highly non-local situation.
Generally speaking, locality in distributed models is basically the extent to which a global solution for a computational problem problem can be obtained from locally available data.
Good locality is when most nodes can construct solutions using local data, since they'll require less communication to get any missing data. Bad locality would be if a node spends more than desirable time fetching data, rather than finding a solution using local data.
Think of a simple distributed computer system which comprises a collection of computers each somewhat like a desktop PC, in as much as each one has a CPU and some RAM. (These are the nodes mentioned in the question.) They are assembled into a distributed system by plugging them all into the same network.
Each CPU has memory-bus access (very fast) to data stored in its local RAM. The same CPU's access to data in the RAM on another computer in the system will run across the network (much slower) and may require co-operation with the CPU on that other computer.
locality is a property of the data used in the algorithm, local data is on the same computer as the CPU, non-local data is elsewhere on the distributed system. I trust that it is clear that parallel computations can proceed more quickly the more that each CPU has to work only with local data. So the designers of parallel programs for distributed systems pay great attention to the placement of data often seeking to minimise the number and sizes of exchanges of data between processing elements.
Complication, unnecessary for understanding the key issues: of course on real distributed systems many of the individual CPUs are multi-core, and in some designs multiple multi-core CPUs will share the same enclosure and have approximately memory-bus-speed access to all the RAM in the same enclosure. Which makes for a node which itself is a shared-memory computer. But that's just detail and a topic for another answer.

How to decide order of a B-tree

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?
Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Is it good to create virtual machines(nodes) to get better performance on cassandra?

I know Cassandra is good in multiple nodes set up. The more nodes,the better performance. If I have two dedicated servers with same hardware, it would be good I create some virtual machines in both of them to have more nodes, or not?
For example I have two dedicated server with this specifications:
1TB hard drive
64 GB RAM
8 core CPU
then create 8 virtual machine(nodes) in both of them. each of them has:
~150GB hard drive
8 GB RAM
share 8 core CPU
So I have 16 nodes. Are these 16 nodes had better performance than 2 nodes with this two dedicated server?
In the other word which side of this trade off is better, more nodes with lower hardware or two stronger nodes?
I know it should be tested, but I want to know basically is it reasonable or not?
Adding new nodes always adds some overhead, they need to communicate within each other and sync their data. Therefore, the more nodes you add, you'd expect the overhead to increase with adding each node. You'd add more nodes only in a situation where the existing number of nodes can't handle the input/output demands. Since in the situation you are describing , you'd be actually writing on the same disk, you'd actually effectively be slowing down your cluster by adding more nodes.
Imagine the situation: you have a server, it receives some data and then writes it on disk. Now imagine the same situation, where the disk is shared between two servers and they both write the same information at the almost same time on the same disk. The two servers also use cpu cycles to communicate between each other that the data has been written so they can sync up. I think this is a sufficient enough information to describe to you why what you are thinking is not a good idea if you can avoid it.
EDIT:
Of course, this is the information only in layman's terms, C* has a very nice architecture in which data is actually spread according to an algorithm to a certain range of nodes (not all of them) and when you are querying for a specific key, the algorithm actually can tell you where to find the data. With that said, when you add and remove nodes, the new nodes have to communicate with the cluster that they want to share 'the burden' and as a result, a recalculation of what is known as a 'token-ring' takes place at the end of which data may be shuffled around so it is accessible in a predictable way.
You can take a look at this:
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2
But in general, there is indeed some overhead when nodes communicate with each other, but the number of the nodes would almost never negatively or positively impact your query speed dramatically if you are querying for a single key.
"I know it should be tested, but I want to know basically is it reasonable or not?"
That will answer most of your assumptions.
The basic advantage of using cassandra is availability. If you are planning to have just two dedicated servers, then there is a question mark on your availability of data. Considering the worst case, you always have just two replicas of data at any point of time.
My take is to go for a nicely split dedicated set up in small chunks. Everything boils down to your use case.
1.If you have a lot of data flowing in and if you consider data as king(in such a case , you need more replicas to handle in case of failures), i would prefer a high end distributed set up.
2.If you are looking for the other way around(data is not your forte and your data is just another part of your set up), you shall just go for the set up what you have mentioned.
3.If you have a cost constraint and if you are a start up with a minimal data that is important to you, set up in two nodes what you have with replication of 2(Simple Strategy ) and replication of 1(Network Topology)

Maintain "links" when writing a B+ tree to disk?

I have done the implementation of B+ tree in java , but as usual , that is completely in main memory. How can i store a B+ tree onto a disk ? Each node of the btree contains the pointers(main memory addresses or reference to objects) to its children , how can i achieve the similar thing while Btree resides on disk? What replaces main memory addresses in b+ tree nodes in the scenario when b+ tree is on disk ?
There is already a similar Question posted here :
B+Tree on-disk implementation in Java
But i dont fully understand the answer.
please share your views ?
Take a look at the code of JDBM3 in github. This project persists B+ Tree and similar data structures to disk and you will definitely find your answer there.
In the most simplistic form: You will have to keep track of the file offset (the number of bytes from the start of the file) the current node is read from or written to. So instead of memory addresses, on file based DBs, you save offsets.
Then when it is read from the file, you can decide to "cache" it in memory, and save the memory address for the given node, or only operate on offsets.
With this said, I have to add, that usually a file based DB is more complicated than this, optimising disk access by writing nodes to pages (which are usually the same size as the pages on the disk). This way, you can read more than one node with one disk seek operation (regarded as expensive operation).

Resources