Mongodb - make inmemory or use cache - performance

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?

MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Related

Elasticsearch: What if size of index is larger than available RAM?

Assuming a single machine system with an in-memory indexing schema.
I am not able to find this info in ES docs. Does ES start swapping out the overflowing data, loads it when needed and continue working or it gives an error?
In-memory indices provide better performance at the cost of limiting the index size to the amount of available physical memory.
Via the 1.7 documentation. Memory stores are no longer available in 2.0+.
Under the hood it uses the Lucene RAMDirectory, which will just consume RAM (and eventually swap) until either you hit Java heap limits and ES crashes with out-of-memory errors, or the system gives up and oomkills the Elasticsearch process. Don't use in-memory indexes for large indexes, or for any situation where persistence is important.

Can Redis use disk as part of a LRU cache?

We have the need for a distributed LRU cache, but one which can use both memory and disk. We have a large dataset, which is stored on disk permenantly. From that dataset, we create other calculated datasets, but only when clients need them.
Since these secondary datasets are derived from data which is persistent, we never need to permanently save this derived data.
I thought that Redis would have the ability to use disk as a secondary LRU cache, but have not been able to find any documentation that points to that. It seems like Redis only uses the disk to persist the entire cache. I envisioned that we'd be able to scale out horizontally with a bunch of Redis instances.
If Redis can not do this, is there another system that does?
If the data does not fit into memory, the OS can swap it out to the disk. This is called virtual memory. Here you find an explanation: http://redis.io/topics/virtual-memory
Remark: You want to retrieve some data, do stuff on it and you have some intermediate results. Please check whether you may want to distribute your processing, not only the data. Take a look at Apache Hadoop and especially Apache Spark.
The way to solve this problem without changing how your clients work, is in fact not to use Redis, but instead to use a Redis compatible database like Ardb which in turn can be configured to use LevelDB under the hood which supports LRU type on-disk caches.

Membase and Redis when they have to store on the disk

I see a lot of benchmark in which people compare Membase with Redis, only when the database can be stored all in memory. Obviously Redis is much better, but, if both start to store the data on the disk, which is better?
I am not as familiar with Membase as with Redis, but I know the performance of storing to disk is extremely hardware and (even more so) configuration dependent. With Redis, for example, you have a number of persistence choices with wildly different performance characteristics. Choosing which is most appropriate depends a lot on your use cases, durability requirements, and the resources at your disposal.
For maximum performance (with extremely high durability too) you can suffer hardly any penalty for saving to disk with Redis by setting up a master/production Redis server that NEVER saves to disk, and one or more slave servers which replicate the master data and aggressively saves to disk. This means your master only has to service a few extra READ operations to send down the synchronization data to your slave(s), and your slave(s) are the only ones suffering any disk I/O penalties. You can restore the data on restart/crash to your master by temporarily making it a slave of one of your slaves, then restoring it to master when synchronization is complete.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Cache systems - Hypertable vs Memcached

I want to implement a cache system for our application, we've started integrating with Memcached. Recently I started hearing of Hypertable, and saw some great benchmarks done with that..
However, I couldn't find good comparison between the two.
Just to get things straight: I know that Hypertable is considered closer to a DB than to a cache. On the other hand, it's not exactly an RDBMS - in fact, it's exactly not an RDBMS. It has its own benefits, but the question is whether they're worth the performance cost (if any)?
Hypertable is an implementation of concepts in Google's BigTable. Namely a column-oriented DB which has properties of being highly denormalized which means it doesn't need joins.
Memcached is an in-memory caching layer which acts like a distributed hashtable, keeping your app from having to hit the actual DB.
Both lend themselves well to being distributed and work well with MapReduce style topologies but they serve different purposes. Memcached/DHT is going to serve to speed access to data in memory while HyperTable/BigTable are actual mechanisms for permanent data storage on disk.
Memcached is used for speeding things up, e.g. results of SQL queries, without going to DB, by storing everything in memory (RAM).
Hypertable (HBase, Cassandra, MongoDB etc.) and others are permanent storage NoSQL DBs (data stored and retrieved from Hard Drives). They can't give you the performance of the reading/writing from/to RAM (e.g. memcached). So these are not compared to one another.
A better use case is to use NoSQL DBs for permanent storage, and using memcached as a front-side fast access cache between web-application and (NoSQL or any) DB.

Resources