In MongoDB, when you want to let say have a 3 shards with replica-sets, you end up needing, for production use, a minumum of 7 servers (2 per shard for high-availability + 1 arbitrer at least).
In RethinkDB, I can't find some equivalent or good sugestions, in terms of cluster architecture & design.
Any help is welcome.
You need at least 3 servers for automatic failover to work. You generally want one shard per server. I'd recommend starting with that unless you're doing out of date reads or the write load on your servers is too high, in which case I'd switch to having one shard or replica per server (so number of servers = number of shards * replication setting).
RethinkDB doesn't have separate arbiters, so those don't need to enter into your calculation.
Related
I would like some guidance on configuring replicas and shards for RethinkDB cluser.
Let's say my cluster consists of 4 instances, all in the same region but in different AZs. Should I set the number of replicas to 4? What are the tradeoffs between choosing between 3 and 4 replica sets in this configuration? How do I determine how many shards to create and does this impact disk use of performance depending on the number of replica sets I have chosen?
For best performance you usually want one shard per server. (The only exception to this is if your servers have a lot of processors, in which case you might get better performance from having 2-3 shards per server.) The number of replicas should be at least 3 if you want auto-failover to work; making it higher will allow your cluster to survive a higher number of concurrent server failures.
I am testing now clustering with ElasticSearch and have question about the replicas between the nodes.
As you can see in the screenshot from Head I have 2 indexes.
movies has 5 shards and 2 replica
students has 5 shards and 1 replica
Which one is better and which one is faster with 3 active nodes and why?
Costs of having more number of replicas would be
more storage space required(Obviously)
less indexing performance
while the advantage from it would be
better search performance
better resiliency
Note that even though you have 2 replicas, it does not mean that your cluster can endure 2 nodes going down since all indexing request would fail if only one out of 3 copies of shards is available.(because of indexing quorum)
For detailed explanation please refer to this official document
"Better" is subjective.
With two replicas, you can handle two of the three machines in your cluster going down, though at the price of writing all the data to every machine. Read performance should also be higher as the cluster has more nodes from which to request the data.
With one replica, you can only survive the outage of one machine in your cluster, but you'll get a performance boost by writing 2 copies of the data across 3 servers (less IO on each server).
So it comes down to risk and performance. Hope that helps.
By default Elasticsearch is configured to start with 5 shards.
Is there a reason to use 5 shards locally (on my development machine) and on the continuous integration server (for integration tests)? Is it better to use 1?
Obviously I don't care about scalability in those cases, I just want the simplest setup.
The simplest setup is 1 primary shard, 0 replicas.
If you only have one node and replica count is >0 it will always be yellow. Not a problem per se, but those will not be needed.
If you want to test search response time with that one shard, for example, it depends on some factors if 1 is enough or you need more. The simplest rule of thumb is to have shards no larger than 30-50GB, for example. But this number also depends on factors.
So, I'd say if you have one node, start with 1 primary, 0 replicas. If that primary is too "large", think about having more primaries (each shard will do part of the work and each will use one core CPU for searching).
Once you've pushed some data with a specific shard configuration, you cannot set a different number of shards without re-index your data. So my guess is that the default configuration of elasticsearch is made so that you can scale your cluster to 5 nodes (then each node gets one shard) without headaches.
from the elasticsearch documentation:
A new index in Elasticsearch is allotted five primary shards by default. That means that we can spread that index out over a maximum of five nodes, with one shard on each node. That’s a lot of capacity, and it happens without you having to think about it at all!
We have java application with embedded Elasticsearch in a cluster of 14 nodes. All the data resides in a central database, and they are indexed in elasticsearch for querying. A full reindex can be done at any time.
The system are very query-heavy, the amount of writes are small. The number of documents will not be higher than, say, 300.000.
The size of each document varies greatly, from just a couple of ids, to extracted text from e.g word-documents of several pages.
I want to make sure that in case of a total breakdown, it should be sufficient that one or two nodes are available for the system to work.
Write consistency should not be a problem since the master copy of the data is in the database, and it seems that ES is capable of resolving conflicting data by using the newest version (which should be all right in our case)
My first though is to use 1 shard, and 13 replicas. This will naturally ensure that all nodes have access to all data. This could also be accomplished by having 2 shards / 13 replicas, so this yield that to ensure that all data is available, the number of replicas should be the number of nodes - 1, not depending on the number of shards (which could be anything).
If the requirement of number of nodes are reduced to "2 nodes should be up at any time", then a shards / replica distribution of "x/number of nodes - 2" should be sufficient.
So, for the question:
Asserting the above setup and that my thoughts is correct, would a setup with 1 shard / 13 replicas make sense or would there be anything to gain by adding more shards and run e.g a 4 shards/13 replicas setup?
After a good bit of research and talking to ES-gurus;
As long as the shard size is small enough, the most efficient way of setting up this cluster would indeed be 1 shard only, with 13 replicas. I have not been able to pinpoint the threshold size of the shard for this starting to perform worse.
If the index is big... you will need more than one shard (if you want perfomance). Do You really need 13 replica? When you put only 2 replicas, ES manage that to keep it that way, if the principal node fail, ES will create a new reply. May be you will need a balancer node too.
According to the elasticsearch documentation, the rule for write_consistency level quorum is:
quorum (>replicas/2+1)
Using ES 0.19.10, on a setup with 16 shards / 3 replicas we will get
16 primary shards
48 replicas
Running 2 nodes, we will have 16(primary) + 16(replicas) = 32 active shards.
For the quorum rule to be met, quorum > 48/2 + 1 = 25 active shards.
Now, testing this proves otherwise, write_consistency level is not met (write operations times out) until we have 3 nodes running. This kind of makes sense, since we could get a split-brain between groups of 2 nodes each in this setup, but I dont quite understand how this rule is supposed to work? Am I using the wrong numbers here?
Primary shard count doesn't actually matter, so I'm going to replace it with N.
If you have an index with N shards and 2 replicas, there are three shards in the replication group. This means the quorum is two: the primary plus one of the replicas. You need two active shards, which usually means two active machines, to satisfy the write consistency parameter
An index with N shards and 3 replicas has four shards in the replication group (primary + 3 replicas), so a quorum is three.
An index with N shards and 1 replica is a special case, since you can't really have a quorum with only two shards. With only one replica, Elasticsearch only requires a single active shard (e.g. the primary), so the quorum setting is identical to the one setting for this particular arrangement.
A few notes:
0.19 is really old, you should definitely, absolutely, positively upgrade. I can't even count how many bugfixes and performance improvements have been added since that release :)
Write consistency is merely a gateway check. Before executing the indexing request, the node will do a straw-poll to see if write_consistency is met. If it is, it tries to execute the index and push the replication. This doesn't guarantee that the replicas will succeed...they could easily fail and you'll see it in the response. It is simply a mechanism to halt the indexing process if the consistency setting is not satisfied.
A "fully replicated" setup with two nodes is 1 primary shard + 1 replica. Each node has a complete set of data. There is no reason to have more replicas, since ES refuses to put the copies of the same data on the same machine (doesn't make sense, doesn't help HA). The inability to index is just a side effect of write consistency, but it's pointing out a bigger problem with your setup :)