Does in make sense to invest into additional in-memory cache for HBase data like GridGain or Redis or it's better to put more resources into HBase's cache?
We have read pattern approximately 50/50, access by key/scan queries.
Related
I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....
I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.
We have the need for a distributed LRU cache, but one which can use both memory and disk. We have a large dataset, which is stored on disk permenantly. From that dataset, we create other calculated datasets, but only when clients need them.
Since these secondary datasets are derived from data which is persistent, we never need to permanently save this derived data.
I thought that Redis would have the ability to use disk as a secondary LRU cache, but have not been able to find any documentation that points to that. It seems like Redis only uses the disk to persist the entire cache. I envisioned that we'd be able to scale out horizontally with a bunch of Redis instances.
If Redis can not do this, is there another system that does?
If the data does not fit into memory, the OS can swap it out to the disk. This is called virtual memory. Here you find an explanation: http://redis.io/topics/virtual-memory
Remark: You want to retrieve some data, do stuff on it and you have some intermediate results. Please check whether you may want to distribute your processing, not only the data. Take a look at Apache Hadoop and especially Apache Spark.
The way to solve this problem without changing how your clients work, is in fact not to use Redis, but instead to use a Redis compatible database like Ardb which in turn can be configured to use LevelDB under the hood which supports LRU type on-disk caches.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a Debian server with about 16GB RAM that I'm using with nginx and several heavy mysql databases, and some custom php apps. I'd like to implement a memory cache between Mysql and PHP, but the databases are too large to store everything in RAM. I'm thinking a LRU cache may be better so far as I research. Does this rule out Redis? Couchbase is also a consideration.
Supposing there is a unique server running nginx + php + mysql instances with some remaining free RAM, the easiest way to use that RAM to cache data is simply to increase the buffer caches of the mysql instances. Databases already use LRU-like mechanisms to handle their buffers.
Now, if you need to move part of the processing away from the databases, then pre-caching may be an option. Before talking about memcached/redis, a shared memory cache integrated with php such as APC will be efficient provided only one server is considered (actually more efficient than redis/memcached).
Both memcached and redis can be considered to perform remote caching (i.e. to share the cache between various nodes). I would not rule out redis for this: it can easily be configured for this purpose. Both will allow to define a memory limit, and handle the cache with LRU-like behavior.
However, I would not use couchbase here, which is an elastic (i.e. supposed to be used on several nodes) NoSQL key/value store (i.e. not a cache). You could probably move some data from your mysql instances to a couchbase cluster, but using it just for caching is over-engineering IMO.
As Matt Ingenthron pointed out and Hari noted that Couchbase supports working as a direct Memcached replacement. Couchbase utilizes memcached in a non-elastic way, as in each node participating in the memcache cluster is discreet with no persistence, i.e. just a cache but couchbase also offers "Couchbase" bucket types which do provide persistence. Membase is part of the code as well so Couchbase not only serve data from disk but also from RAM and persists it there while replicating to other nodes and persisting to disk as changes are applied. I would highly recommend Couchbase 3.x for both caching and persistence in one footprint, or multiple footprints if you just wanted only a caching layer separate from your persistence layer.
We used memcached initially to cache data. In memcached partitioning data for different applications under different bucket was a real issue.Also we have a requirement to flush data from one bucket alone. Monitoring data is another requirement. We moved to Couchbase and use the memcache-style bucket. I guess its much more flexible and efficient to use Couchbase memcache-style bucket for caching rather than using memcached.
Have you ever considered to move your databases totally to RAM using one of the in-memory NoSQL solutions with persistence? It could take less storage than your original MySQL database, because many NoSQL solutions usually have less footprint than SQL databases. Besides, if server side logic is very important for you, then try Tarantool as it has Lua scripting onboard and should have a quite small memory footprint. In my cases the same data in Tarantool occupied twice less than in MySQL. This is because they have small overhead per row and per field and use messagepack for data storing.
I want to implement a cache system for our application, we've started integrating with Memcached. Recently I started hearing of Hypertable, and saw some great benchmarks done with that..
However, I couldn't find good comparison between the two.
Just to get things straight: I know that Hypertable is considered closer to a DB than to a cache. On the other hand, it's not exactly an RDBMS - in fact, it's exactly not an RDBMS. It has its own benefits, but the question is whether they're worth the performance cost (if any)?
Hypertable is an implementation of concepts in Google's BigTable. Namely a column-oriented DB which has properties of being highly denormalized which means it doesn't need joins.
Memcached is an in-memory caching layer which acts like a distributed hashtable, keeping your app from having to hit the actual DB.
Both lend themselves well to being distributed and work well with MapReduce style topologies but they serve different purposes. Memcached/DHT is going to serve to speed access to data in memory while HyperTable/BigTable are actual mechanisms for permanent data storage on disk.
Memcached is used for speeding things up, e.g. results of SQL queries, without going to DB, by storing everything in memory (RAM).
Hypertable (HBase, Cassandra, MongoDB etc.) and others are permanent storage NoSQL DBs (data stored and retrieved from Hard Drives). They can't give you the performance of the reading/writing from/to RAM (e.g. memcached). So these are not compared to one another.
A better use case is to use NoSQL DBs for permanent storage, and using memcached as a front-side fast access cache between web-application and (NoSQL or any) DB.