I need to understand in detail how to design efficient data structures in Cassandra. Is there an online demo or tutorial for understanding the data structure of Cassandra? I need to be able to design column families with their columns and payloads, and see some specific, tangible examples. I'd appreciate it if anyone could recommend a source that would allow me to do this.
In the several thousands of classes that make up the Cassandra codebase, I doubt C*'s performance can be attributed to a single data structure. This topic is a bit complicated for a single online demo, however...
What better source than the source... Start looking through code and checkout what data structures are used. Data in memory is stored in something called a memtable which is a sorted string table (sstable). The in-memory data is then flushed to disk and again stored in sstables. This SO question does a comparrison between binary tries and sstables for indexing columns in the dB.
The other data structure I found to be interesting is the merkle tree, used during repairs. This is a hashed binary tree. There are many advantages and disadvantages when using the merkle tree, but the main advantage (and i guess disadvantage) is that it reduces how much data needs to be transferred across the wire for repairs (aka tree synchronization) at the expense of local io required for computing the tree's hashes. Read more details in this SO answer and read about merkle trees on wikipedia. There is also a great description of how the merkle trees are used during repair in sections 4.6 and 4.7 in the dynamo paper.
Related
I have a hierarchical data structure. There is not much addition, deletion done to this structure, its mostly for reading and searching. I'm trying my best to find a good data structure to store this data to enable fast searching. All the examples/tutorials I have seen talk about some form of binary tree. Is there a data structure (tree) that will enable me to model this effectively. An alternative form I can think of is to use a graph, but I'm not sure about that.
B-Tree will be the best choice for your description because of its amazing performance in "reading and searching", it will enable you achieve log(n) for insertion/deletion/search, beside it's a cache friendly so you will get the minimum number of cache misses.
i have to explain what data structure is to someone, so what would be the easiest way to explain it? would it be right if i say
"Data structure is used to organize data(arrange data in some fashion) so that we can perform certain operation fastly with as little resource usage as possible"
How values are placed in locations together and their location addresses and indices are stored as values too.
And that as very abstract "structures" so one has linked lists, arrays, pointers, graphs, binary trees. And can do things with them (the algorithms). The capabilities like being sorted, needing sortedness, fast access and so on.
This is fundamental, not too complicated, and a good grasp of data
structures, the correct usage of data structures can solve problems
elegantly. For learning data structures a language like Pascal is more
beneficial than C.
In computer science, a data structure is a particular way of organizing data in a computer so that it can be used efficiently.
Source: wikipedia (https://en.wikipedia.org/wiki/Data_structure)
I would say what you wrote is pretty close. :)
I am currently working on frequent pattern mining(FPM). I was googling about the data structures which can be used for FPM. My main concern is space-compactness of the data structures as am planning to use distributed algorithm over it (handling synchronization over a DS that fits in my main memory). The list of data structures i have come across are,
Prefix-Tree
Compact Prefix-Tree or Radix Tree
Prefix Hash Tree (PHT)
Burst Tree (currently reading how it works)
I dunno the order in which each data structure evolved. Can anyone tell me which DS (not limited to the DS mentioned above) is the best Data Structure that fits my requirements ?
P.S: currently am considering burst tree is the best known space-efficient data structure for FPM.
I agree that the question is broad. However, if you're looking for a space-efficient prefix tree, then I would strongly recommend a Burst Trie. I wrote an implementation and was able to squeeze a lot of space efficiency out of it for Stripe's latest Capture the Flag. (They had a problem which used 4 nodes at less than 500mb each that "required" a suffix tree.)
If you're looking for an implementation of an efficient burst trie then check mine out.
https://github.com/nbauernfeind/scala-burst-trie
Having been learning data-structure and algorithm for a long time, I'm still uncertain about the practical application of those famous data-structure such as red-black tree, splay tree.
I know that B-tree has been widely used in database stuff.
With respect to other tree data-structure like red-black tree and splay tree etc,
have they been widely used in practice? If any, give some example.
Unlike B-tree whose structure can be retained and saved in disk, red-black and splay tree cannot achieve that, they are just in-memory structure, right? So how can they be as popular as B-tree?
I know that B-tree has been widely used in database stuff.
That isn’t very specific, is it?
In fact, B trees and red-black trees serve the exact same purpose: Both are index data structures, more precisely search trees, i.e. data structures that allow you to efficiently search for an item in a collection.
The only relevant difference between red-black trees and B trees is the fact that the latter incorporate some additional factors that improve their caching behaviour, which is required when access to memory is particularly slow due to high latency (simply put, an average access to the B tree will require less jumping around in memory than it does in the red-black tree, and more reading of adjacent memory locations, which is often much faster).
Historically, this has been used to store the index on a disk (secondary storage) which is very slow compared to main storage (RAM). Red-black trees, on the other hand, are often used when the index is retained in RAM (for example, the C++ std::map structure is usually implemented as a red-black tree).
This is going to change, though. Modern CPUs use caches to improve access to main memory further, and since acesss to the RAM is much slower than the cache, B trees (and their variants) once again become better suited than red-black trees.
Probably the most widely-used implementations of the red-black tree are the Java TreeMap and TreeSet library classes, used to implement sorted maps and sets of objects in a tree-like structure. Cadging a bit from this Wikipedia article, red-black trees require less reshuffling to be done on inserts and deletes, because they don't impose as stringent requirements on the fullness of the structure.
Many applications of sorted trees do not require the data structure to be written to disk. Often, data is received or generated in arbitrary order and sorted solely for the use of another part of the same program. At other times, data must be sorted before being output, but is then simply output as a flat file without conveying the tree structure. In any case, relatively few on-disk file formats are derived from simply writing the contents of memory to disk; storing data this way requires annoying pointer adjustments, and more importantly make the on-disk format depend on such details as the processor data word size, system byte order, and word alignment. Data is far more commonly either written out as (perhaps compressed) text, or is written to disk in a carefully-defined binary format. The only cases I can think of where any sorted tree is written to disk are databases and file systems, where the structure is loaded from disk into memory and used as is; in this case, B-trees are indeed the preferred data structure.
My favourite example of practical usage is in CPU scheduling, this task scheduler which employs an RB tree was shipped with the Linux 2.6.23 kernel. Of course there's plenty more as has already been pointed out, this is just my personal favourite.
There are many strategies for disk space (and memory) management in databases.
I try to track the best ones like log-structured merge tree in form of BigTable (and HBase, Hypertable, Cassandra) or fractal tree used in TokuDB. From what I have mentioned it is easy to guess, I mean algorithms what use wisely resources (for example avoiding I/O and scale well).
Are there other algorithms like LSM tree? Just direct me.
currently , google release levelDB (you can search it in google);
People say it is the memtable sstable implemetion of google's bigtable!
I think it is a simple version after read some source code!
Hope it can give some help
and nessDB.
It's using a simple LSM-Tree, https://github.com/shuttler/nessDB
H2Database's MVStore uses Log Structured Storage, a slight similar to LSM-Tree
Fragmented LSM-Tree, implemented in PebblesDB
WiscKey, implemented in this contest project