How efficiently are documents stored on disk? - rethinkdb

In RethinkDB, how efficiently are documents stored on disk? How does this compare to other databases, like MySQL, MongoDB, PostgreSQL, Cassandra? Does RethinkDB use greater, lesser, or equal space to store the same or similar data? This is what I mean by 'efficient'.
Obviously, I imagine that it may be difficult to compare in this way, but I am curious nevertheless. It would be helpful to get a sense of the general disk usage profile of each database, and whether some databases may be more or less efficient with their disk usage than others. In particular, it would be great to get a sense of where RethinkDB stands in this regard.

Related

What is the fundamental difference between ElasticSearch and a cache?

Theoretically speaking, can't you just cache search results from a SQL query made to the database making it similar to ElastiSearch? I understand you would run into invalidation issues, but what are the fundamental differences between ElastiSearch and a cache like Redis?
Elasticsearch is primarily a Search engine optimized to store and retrieve structured or semi-structured data. It takes care of processing structured/semi-structured data, indexes and provides a nice DSL to query data. Oh, and it happens to be super fast :)
A distributed cache like Memcached and Redis ( BTW Redis is not just cache, but a data structure store) primarily stores Key-Value pairs for faster lookup. Think of your local hash table distributed across a bunch of machines.
Two different use cases. If it's just for the cache - Elasticsearch may not be the right choice.

what should be considered before choosing hbase?

I am very new in big data space.
We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas?
One of my favourite book describes..
Coming to #Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
In this context Hbase supports CP
However, for switching RDBMS to HBASE you can use SQOOP.
It's a difficult question, there are many things to consider.
Can you optimize your RDBMS? Adding indexes, denormalization of joins that cost too much ... There are many path to consider and I am no expert.
Is your data big? This is very vague, and you have a space between RDBMS and Big Data where you can't be sure which one to use. Millions of rows can still be handled by RDBMS efficiently.
Do you need relation in you data? NoSQL database don't use relation, this can be hard for people from a SQL background. There are frameworks that gives SQL to HBase, but it is a bad idea in general to have a RDBMS model when using Big Data
If you can answer those questions and you think NoSQL is the drill, ask your team how they feel about it. NoSQL database comes with problem you would never meet in the SQL world. They should build a prototype first to understand how all this works, and maybe make some training available for them.
In Summary:
- Find if you need non relational database
- Choose the right one (is Hbase really what you need?, why not consider Cassandra or MongoDB?)
HBase like all NoSQL DB come with great new features but sadly nothing is free (not even mentionning the money cost).
In HBase, you really should check if all the query that you might want to do can be fullfilled with the HBase data model. An important thing to consider is the schema design (the modelisation of the rowkey most and foremost).
I advice you to read this really good paper :
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
I think that a really good answer to your question can be found on the HBase official site.
"HBase isn’t suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. "
https://hbase.apache.org/book.html

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Cache systems - Hypertable vs Memcached

I want to implement a cache system for our application, we've started integrating with Memcached. Recently I started hearing of Hypertable, and saw some great benchmarks done with that..
However, I couldn't find good comparison between the two.
Just to get things straight: I know that Hypertable is considered closer to a DB than to a cache. On the other hand, it's not exactly an RDBMS - in fact, it's exactly not an RDBMS. It has its own benefits, but the question is whether they're worth the performance cost (if any)?
Hypertable is an implementation of concepts in Google's BigTable. Namely a column-oriented DB which has properties of being highly denormalized which means it doesn't need joins.
Memcached is an in-memory caching layer which acts like a distributed hashtable, keeping your app from having to hit the actual DB.
Both lend themselves well to being distributed and work well with MapReduce style topologies but they serve different purposes. Memcached/DHT is going to serve to speed access to data in memory while HyperTable/BigTable are actual mechanisms for permanent data storage on disk.
Memcached is used for speeding things up, e.g. results of SQL queries, without going to DB, by storing everything in memory (RAM).
Hypertable (HBase, Cassandra, MongoDB etc.) and others are permanent storage NoSQL DBs (data stored and retrieved from Hard Drives). They can't give you the performance of the reading/writing from/to RAM (e.g. memcached). So these are not compared to one another.
A better use case is to use NoSQL DBs for permanent storage, and using memcached as a front-side fast access cache between web-application and (NoSQL or any) DB.

Resources