Couchbase Community Edition as a Cache - caching

It is mentioned here(http://www.couchbase.com/memcached) that couchbase can be used as the caching layer. I am supposed to use the community edition for my caching layer. As found in Internet, there are many large scale organizatios are using for heavy usage, but the size of their caches are around 1kb to 100kb. I want to know,
will there be a performance draw backs when large objects(1mb-10mb size) are cached and when it is replicated ?
will data be synchronized/replicated among nodes as soon as they are updated
any one has experience?

To answer your questions:
Will there be a performance draw backs when large objects(1mb-10mb size) are cached and when it is replicated ?
Couchbase has a maximum document size of 20MB for Couchbase type buckets. Depending on your settings each document will need to be written both to disk and across the network to each replica node. Other than the actual disk / network bandwidth required for this you shouldn't see any particular performance issues.
Will data be synchronized/replicated among nodes as soon as they are updated?
As documented in the Couchbase Admin Guide, data is queued to be replicated to replica nodes as soon as it is received by the master.
Couchbase automatically shards each Bucket into a number of vBuckets, and each vBucket is "owned" by just a single master node, so a client will normally only need to communicate with one node for a particular document; therefore replication time isn't relevant for consistency (it's mainly there to provide backup copies in the event of a node failure).
You may also want to look at the high level Architecture and Concepts of Couchbase to see how it all fits together.

Related

Apache Ignite NearCaches Vs CachMode

Apache Ignite has two concepts, one of them is NearCache, and another one is the CacheMode enumaration.
What is the main difference between two concepts?
Near cache is the local hot cache that keeps often accessed data. It significantly speeds up data processing, saving time on network round-trips.
CacheMode defines how your data will be stored. It could be LOCAL for single node, which means data are not distributed in grid. Other two PARTITIONED and REPLICATED means respectively: cache data divided between nodes on some equal parts (called partitions) or each node keeps full data from that cache.
PARTITIONED allows you to keep in grid more data than available in separate machine, REPLICATED gives 100% data survivorability (if all nodes crashed except one - you will not loose your data).
More details you can find in documentation https://apacheignite.readme.io/docs/near-caches and https://apacheignite.readme.io/docs/cache-modes

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Cache huge data in-memory

I am looking for an in-memory cache solution which can handle big data (<5GB). For a user inputted search term, the database (elasticsearch) will return a large amount of data which the tool will analyze and show via different webpages of the tool. Now my problem is that I want to cache this big data temporarily till the user session gets over so that I don't have to fetch it again from elasticsearch every time the user opens a new page. It will have to be in-memory because disk based will take over a minute which would be very slow.
I initially thought memcached but it has a max limit of 128MB. After reading quite a bit, Redis seems suitable but it is unclear to me whether a bunch of Redis nodes can work in tandem or not. Is it possible to set up a pool of many Redis nodes so that a suitable node will be automatically chosen on SET and the data returned upon GET without me having to specify the node?
TL;DR
Problem: Cache big data (<5GB) in an in-memory cache
Possible solution: Redis
Question: Can I pool a bunch of Redis nodes so that I can fetch a key stored in any of them without specifying a particular node. I don't need to distribute my data since data for a single user will fit into the RAM of a single node.
A Redis Cluster sounds like a good fit for your usecase!
Redis cluster provides a mechanism for data sharding by means of hash slots. These slots are equally distributed over the nodes in your cluster when setting it up.
Whenever you store a value in the cluser, the corresponding hash slot for the given key is calculated and the data is forwarded to the responsible node. And the same way you can afterwards query your data again. So the answer to your question is certainly yes.
However, the max value size per key is 512MB. I'm not sure if I got your storage requirement correctly. I assume 5GB is the estimated total amount over all users.
Checkout the redis cluster tutorial.
You can also look into NCache(.net) / Tayzgrid(java) by Alachisoft,
Both of these solutions provide distributed caching with dynamic clustering which allows to add or remove nodes in cluster at runtime with out losing any data. Also intelligent client makes sure to refer to appropriate node to fetch/store a record against any key.

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

Resources