Cache systems - Hypertable vs Memcached - caching

I want to implement a cache system for our application, we've started integrating with Memcached. Recently I started hearing of Hypertable, and saw some great benchmarks done with that..
However, I couldn't find good comparison between the two.
Just to get things straight: I know that Hypertable is considered closer to a DB than to a cache. On the other hand, it's not exactly an RDBMS - in fact, it's exactly not an RDBMS. It has its own benefits, but the question is whether they're worth the performance cost (if any)?

Hypertable is an implementation of concepts in Google's BigTable. Namely a column-oriented DB which has properties of being highly denormalized which means it doesn't need joins.
Memcached is an in-memory caching layer which acts like a distributed hashtable, keeping your app from having to hit the actual DB.
Both lend themselves well to being distributed and work well with MapReduce style topologies but they serve different purposes. Memcached/DHT is going to serve to speed access to data in memory while HyperTable/BigTable are actual mechanisms for permanent data storage on disk.

Memcached is used for speeding things up, e.g. results of SQL queries, without going to DB, by storing everything in memory (RAM).
Hypertable (HBase, Cassandra, MongoDB etc.) and others are permanent storage NoSQL DBs (data stored and retrieved from Hard Drives). They can't give you the performance of the reading/writing from/to RAM (e.g. memcached). So these are not compared to one another.
A better use case is to use NoSQL DBs for permanent storage, and using memcached as a front-side fast access cache between web-application and (NoSQL or any) DB.

Related

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Can Redis use disk as part of a LRU cache?

We have the need for a distributed LRU cache, but one which can use both memory and disk. We have a large dataset, which is stored on disk permenantly. From that dataset, we create other calculated datasets, but only when clients need them.
Since these secondary datasets are derived from data which is persistent, we never need to permanently save this derived data.
I thought that Redis would have the ability to use disk as a secondary LRU cache, but have not been able to find any documentation that points to that. It seems like Redis only uses the disk to persist the entire cache. I envisioned that we'd be able to scale out horizontally with a bunch of Redis instances.
If Redis can not do this, is there another system that does?
If the data does not fit into memory, the OS can swap it out to the disk. This is called virtual memory. Here you find an explanation: http://redis.io/topics/virtual-memory
Remark: You want to retrieve some data, do stuff on it and you have some intermediate results. Please check whether you may want to distribute your processing, not only the data. Take a look at Apache Hadoop and especially Apache Spark.
The way to solve this problem without changing how your clients work, is in fact not to use Redis, but instead to use a Redis compatible database like Ardb which in turn can be configured to use LevelDB under the hood which supports LRU type on-disk caches.

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Memcached, Redis, or Couchbase [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a Debian server with about 16GB RAM that I'm using with nginx and several heavy mysql databases, and some custom php apps. I'd like to implement a memory cache between Mysql and PHP, but the databases are too large to store everything in RAM. I'm thinking a LRU cache may be better so far as I research. Does this rule out Redis? Couchbase is also a consideration.
Supposing there is a unique server running nginx + php + mysql instances with some remaining free RAM, the easiest way to use that RAM to cache data is simply to increase the buffer caches of the mysql instances. Databases already use LRU-like mechanisms to handle their buffers.
Now, if you need to move part of the processing away from the databases, then pre-caching may be an option. Before talking about memcached/redis, a shared memory cache integrated with php such as APC will be efficient provided only one server is considered (actually more efficient than redis/memcached).
Both memcached and redis can be considered to perform remote caching (i.e. to share the cache between various nodes). I would not rule out redis for this: it can easily be configured for this purpose. Both will allow to define a memory limit, and handle the cache with LRU-like behavior.
However, I would not use couchbase here, which is an elastic (i.e. supposed to be used on several nodes) NoSQL key/value store (i.e. not a cache). You could probably move some data from your mysql instances to a couchbase cluster, but using it just for caching is over-engineering IMO.
As Matt Ingenthron pointed out and Hari noted that Couchbase supports working as a direct Memcached replacement. Couchbase utilizes memcached in a non-elastic way, as in each node participating in the memcache cluster is discreet with no persistence, i.e. just a cache but couchbase also offers "Couchbase" bucket types which do provide persistence. Membase is part of the code as well so Couchbase not only serve data from disk but also from RAM and persists it there while replicating to other nodes and persisting to disk as changes are applied. I would highly recommend Couchbase 3.x for both caching and persistence in one footprint, or multiple footprints if you just wanted only a caching layer separate from your persistence layer.
We used memcached initially to cache data. In memcached partitioning data for different applications under different bucket was a real issue.Also we have a requirement to flush data from one bucket alone. Monitoring data is another requirement. We moved to Couchbase and use the memcache-style bucket. I guess its much more flexible and efficient to use Couchbase memcache-style bucket for caching rather than using memcached.
Have you ever considered to move your databases totally to RAM using one of the in-memory NoSQL solutions with persistence? It could take less storage than your original MySQL database, because many NoSQL solutions usually have less footprint than SQL databases. Besides, if server side logic is very important for you, then try Tarantool as it has Lua scripting onboard and should have a quite small memory footprint. In my cases the same data in Tarantool occupied twice less than in MySQL. This is because they have small overhead per row and per field and use messagepack for data storing.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

Resources