Are there other algorithms like LSM tree? - algorithm

There are many strategies for disk space (and memory) management in databases.
I try to track the best ones like log-structured merge tree in form of BigTable (and HBase, Hypertable, Cassandra) or fractal tree used in TokuDB. From what I have mentioned it is easy to guess, I mean algorithms what use wisely resources (for example avoiding I/O and scale well).
Are there other algorithms like LSM tree? Just direct me.

currently , google release levelDB (you can search it in google);
People say it is the memtable sstable implemetion of google's bigtable!
I think it is a simple version after read some source code!
Hope it can give some help

and nessDB.
It's using a simple LSM-Tree, https://github.com/shuttler/nessDB

H2Database's MVStore uses Log Structured Storage, a slight similar to LSM-Tree
Fragmented LSM-Tree, implemented in PebblesDB
WiscKey, implemented in this contest project

Related

How to achieve Time Travel with clojure

Is there a way to achieve time traveling in Clojure, for example if I have a vector (which is internally a tree implemented as a persistent data structure) is there a way to achieve time traveling and get preivous versions of that vector? Kind of what Datomic does at the database level, since Clojure and Datomic share many concepts including the facts being immutable implemented as persistent data strcutures, technically the older version of the vector is still there. So I was wondering if time traveling and getting previous versions is possible in plain Clojure similarly to what it is done in Datomic at the database level
Yes, but you need to keep a reference to it in order to access it, and in order to prevent it from being garbage collected. Clojurists often implement undo/redo in this way; all you need to do is maintain a list of historical states of your data, and then you can trivially step backward.
David Nolen has described this approach here, and you can find a more detailed example and explanation here.
Datomic is plain Clojure. You can use Datomic as a Clojure library either with an in-memory database (for version tracking) or with no database at all.

Which type of Tree Data Structure is suitable for efficient frequent pattern mining?

I am currently working on frequent pattern mining(FPM). I was googling about the data structures which can be used for FPM. My main concern is space-compactness of the data structures as am planning to use distributed algorithm over it (handling synchronization over a DS that fits in my main memory). The list of data structures i have come across are,
Prefix-Tree
Compact Prefix-Tree or Radix Tree
Prefix Hash Tree (PHT)
Burst Tree (currently reading how it works)
I dunno the order in which each data structure evolved. Can anyone tell me which DS (not limited to the DS mentioned above) is the best Data Structure that fits my requirements ?
P.S: currently am considering burst tree is the best known space-efficient data structure for FPM.
I agree that the question is broad. However, if you're looking for a space-efficient prefix tree, then I would strongly recommend a Burst Trie. I wrote an implementation and was able to squeeze a lot of space efficiency out of it for Stripe's latest Capture the Flag. (They had a problem which used 4 nodes at less than 500mb each that "required" a suffix tree.)
If you're looking for an implementation of an efficient burst trie then check mine out.
https://github.com/nbauernfeind/scala-burst-trie

Cassandra Data Structure

I need to understand in detail how to design efficient data structures in Cassandra. Is there an online demo or tutorial for understanding the data structure of Cassandra? I need to be able to design column families with their columns and payloads, and see some specific, tangible examples. I'd appreciate it if anyone could recommend a source that would allow me to do this.
In the several thousands of classes that make up the Cassandra codebase, I doubt C*'s performance can be attributed to a single data structure. This topic is a bit complicated for a single online demo, however...
What better source than the source... Start looking through code and checkout what data structures are used. Data in memory is stored in something called a memtable which is a sorted string table (sstable). The in-memory data is then flushed to disk and again stored in sstables. This SO question does a comparrison between binary tries and sstables for indexing columns in the dB.
The other data structure I found to be interesting is the merkle tree, used during repairs. This is a hashed binary tree. There are many advantages and disadvantages when using the merkle tree, but the main advantage (and i guess disadvantage) is that it reduces how much data needs to be transferred across the wire for repairs (aka tree synchronization) at the expense of local io required for computing the tree's hashes. Read more details in this SO answer and read about merkle trees on wikipedia. There is also a great description of how the merkle trees are used during repair in sections 4.6 and 4.7 in the dynamo paper.

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

Sort and shuffle optimization in Hadoop MapReduce

I'm looking for a research/implementation based project on Hadoop and I came across the list posted on the wiki page - http://wiki.apache.org/hadoop/ProjectSuggestions. But, this page was last updated in September, 2009. So, I'm not sure if some of these ideas have already been implemented or not. I was particularly interested in "Sort and Shuffle optimization in the MR framework" which talks about "combining the results of several maps on rack or node before the shuffle. This can reduce seek work and intermediate storage".
Has anyone tried this before? Is this implemented in the current version of Hadoop?
There is the combiner functionality (as described under the "Combine" section of http://wiki.apache.org/hadoop/HadoopMapReduce), which is more-or-less an in-memory shuffle. But I believe that the combiner only aggregates key-value pairs for a single map job, not all the pairs for a given node or rack.
The project description is aimed "optimization".
This feature is already present in the current Hadoop-MapReduce and it can probably run in a lot less time.
Sounds like a valuable enhancement to me.
I think it is very challenging task. In my understanding the idea is to make a computation tree instead of "flat" map-reduce.The good example of it is Google's Dremel engine (called BigQuey now). I would suggest to read this paper: http://sergey.melnix.com/pub/melnik_VLDB10.pdf
If you interesting in this kind of architecture - you can also take a look on the open source clone of this technology - Open Dremel.
http://code.google.com/p/dremel/

Resources