Does all the nodes in cassandra cluster know the "partition key ranges" for each other? - cassandra-2.0

Lets say I have a cassandra cluster with the following scheme:
(76-100) Node1 - Node2 (0-25)
| |
(51-75) Node4 - Node3 (26-50)
Each node is primarily responsible for a range of partition keys: For example, for a total range of 0-100, I have indicated what range the node is responsible above.
Now, lets say Node 1 is coordinator handing requests. A read request corresponding to partition key 28 reaches Node 1.
How does Node 1 know that Node 2 is primary node for partition key 28. Does each node have a mapping of node IDs to the partition key they are responsible for.
For instance,
{Node1:76-100, Node2: 0-25, Node3: 26-50, Node4: 51-75}
is this mapping present as global configuration in all the nodes since any node can act as coordinator when requests are forwarded in round-robin fashion?
Thanks

The mapping is not present as a global configuration. Rather each node maintains its own copy of the state of the other nodes in the cluster. Typically the cluster will use the gossip protocol to frequently exchange information about the other nodes in the cluster with a few nearby nodes. In this way the mapping information will rapidly propagate to all the nodes in the cluster, even if there are thousands of nodes.
It is necessary for every node to know how to map partition keys to token values, and to know which node is responsible for that token. This is so that every node can act as a coordinator to handle any request by sending it to the exact nodes that are handling that key.
Taken a step further, if you use for example the current java driver, you can have the client use a token aware routing policy. This works by the client driver also getting a copy of the information about how to map keys to nodes. Then when you issue a request, it will be sent directly to a node that is handling that key. This gives a nice performance boost.
Generally you do not need to worry about how the keys are mapped, since if you use vnodes and the Murmur3Partitioner, the cluster will take care of creating the mapping of keys to balance the load across the cluster as nodes are added and removed.

Related

Number of nodes AWS Elasticsearch

I read documentation, but unfortunately I still don't understand one thing. While creating AWS Elasticsearch domain, I need to choose "Number of nodes" in "Data nodes" section.
If i specify 3 data nodes and 3-AZ, what it actually means?
I have suggestions:
I'll get 3 nodes with their own storages (EBS). One of node is master and other 2 are replicas in different AZ. Just copy of master, not to lose data if master node become broken.
I'll get 3 nodes with their own storages (EBS). All of them will work independent and on their storadges are different data. So at the same time data can be processed by different nodes and store on different storages.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
Please, explain how it works.
Many thanks for any info or links.
I haven't used AWS Elasticsearch, but I've used the Cloud Elasticsearch service.
When you use 3 AZ (availability zones), means that your cluster will use 3 zones in order to make it resilient. If one zone has problems, then the nodes in that zone will have problems as well.
As the description section mentions, you need to specify multiples of 3 if you choose 3 AZ. If you have 3 nodes, then every AZ will have one zone. If one zone has problems, then that node is out, the two remaining will have to pick up from there.
Now in order to answer your question. What do you get with these configurations. You can check so yourself. Use this via kibana or any HTTP client
GET _nodes
Check for the sections:
nodes.roles
nodes.attributes
In the various documentations, blog posts etc you will see that for production usage, 3 nodes and 3 AZ is a good starting point in order to have a resilient production cluster.
So let's take it step by step:
You need an even number of master nodes in order to avoid the split brain problem.
You need more than one node in your cluster in order to make it resilient (if the node is unavailable).
By combining these two you have the minimum requirement of 3 nodes (no mention of zones yet).
But having one master and two data nodes, will not cut it. You need to have 3 master-eligible nodes. So if you have one node that is out, the other two can still form a quorum and vote a new master, so your cluster will be operational with two nodes. But in order for this to work, you need to set your primary shards and replica shards in a way that any two of your nodes can hold your entire data.
Examples (for simplicity we have only one index):
1 primary, 2 replicas. Every node holds one shard which is 100% of the data
3 primaries, 1 replica. Every node will hold one primary and one replica (33% primary, 33% replica). Two nodes combined (which is the minimum to form a quorum as well) will hold all your data (and some more)
You can have more combinations but you get the idea.
As you can see, the shard configuration needs to go along with your number and type of nodes (master-eligible, data only etc).
Now, if you add the availability zones, you take care of the problem of one zone being problematic. If your cluster was as a whole in one zone (3 nodes in one node), then if that zone was problematic then your whole cluster is out.
If you set up one master node and two data nodes (which are not master eligible), having 3 AZ (or 3 nodes even) doesn't do much for resiliency, since if the master goes out, your cluster cannot elect a new one and it will be out until a master node is up again. Now for the same setup if a data node goes out, then if you have your shards configured in a way that there is redundancy (meaning that the two nodes remaining have all the data if combined), then it will work fine.
Your answers should be covered in following three points.
If i specify 3 data nodes and 3-AZ, what it actually means?
This means that your data and replica's will be available in 3AZs with none of the replica in same AZ as the data node. Check this link. For example, When you say you want 2 data nodes in 2 AZ. DN1 will be saved in (let's say) AZ1 and it's replica will be stored in AZ2. DN2 will be in AZ2 and it's replica will be in AZ1.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
It is because when you give your AWS Elasticsearch some amount of storage, the cluster divides the specified storage space in all data nodes. If you specify 100G of storage on the cluster with 2 data nodes, it divides the storage space equally on all data nodes i.e. two data nodes with 50G of available storage space on each.
Sometime you will see more nodes than you specified on the cluster. It took me a while to understand this behaviour. The reason behind this is when you update these configs on AWS ES, it takes some time to stabilize the cluster. So if you see more data or master nodes as expected hold on for a while and wait for it to stabilize.
Thanks everyone for help. To understand how much space available/allocated, run next queries:
GET /_cat/allocation?v
GET /_cat/indices?v
GET /_cat/shards?v
So, if i create 3 nodes, than I create 3 different nodes with separated storages, they are not replicas. Some data is stored in one node, some data in another.

Does cluster reject queries when rolling restart?

Here is the post of rolling restarts:
https://www.elastic.co/guide/en/elasticsearch/guide/master/_rolling_restarts.html
Does it affect any queries running during this process? The process does not explicitly let clusters know a node will be killed, but only stop sync and rebalancing. Are existing queries rejected or retried?
There is another option
"transient": {
"cluster.routing.allocation.exclude._ip":
}
which can 'disable' a node to restart and rebalance data. Is this one better than the approach the link says?
It depends on your cluster configuration. You can avoid it.
If you have replicas and not querying restarting node directly you should be fine.
First of all, take a look at coordinating node note
Requests like search requests or bulk-indexing
requests may involve data held on different data nodes. A search
request, for example, is executed in two phases which are coordinated
by the node which receives the client request — the coordinating node.
In the scatter phase, the coordinating node forwards the request to
the data nodes which hold the data. Each data node executes the
request locally and returns its results to the coordinating node. In
the gather phase, the coordinating node reduces each data node’s
results into a single global resultset.
Every node is implicitly a coordinating node. This means that a node
that has all three node.master, node.data and node.ingest set to false
will only act as a coordinating node, which cannot be disabled. As a
result, such a node needs to have enough memory and CPU in order to
deal with the gather phase.
There could be different edge cases:
you have only one node in cluster: requests will fail -- add more nodes
you have several nodes in cluster and 0 replicas for your shards: if you need data from restarting node in your query, request will partially fail -- have replicas
you have several nodes with replicas and you querying restarting node directly: request will fail -- exclude restarting nodes from your application or always query dedicated coordinating node, it will take care about nodes leaving the cluster.

How many master in three node cluster

I was stumbled at this question that how many masters can be there in a three node cluster. I came across this point in one of a article on internet that search and index requests are not to be sent to elected master. Is that correct? So , if i have three nodes acting as master(out of which one node is elected master) should i point out incoming logs to be indexed and searched onto other master nodes apart from elected master?Please clarify.Thanks in advance
In a three node cluster, all nodes most likely hold data and are master-eligible. That is the most simple situation in which you don't have to worry about anything else.
If you have a larger cluster, you can have a couple of nodes which are configured as dedicated master nodes. That is, they are master-eligible and they don't hold any data. For example you would have 3 dedicated master nodes and 7 data nodes (not master-eligible). Exactly one of the dedicated master nodes will always be the elected master.
The point is that since the dedicated master nodes don't hold data, they will not directly service index and search request. If you send an index or search request to them there's no other way for them than to delegate to one of the 7 data nodes.
From the Elasticsearch Reference for Modules - Node:
dedicated master nodes are nodes with the settings node.data: false
and node.master: true. We actively promote the use of dedicated master
nodes in critical clusters to make sure that there are 3 dedicated
nodes whose only role is to be master, a lightweight operational
(cluster management) responsibility. By reducing the amount of
resource intensive work that these nodes do (in other words, do not
send index or search requests to these dedicated master nodes), we
greatly reduce the chance of cluster instability.
A related question is how many master nodes there should be in a cluster. The answer essentially is at least 3 in order to prevent split-brain (a situation when due to a network error, two masters are elected simultaneously).
The Elasticsearch Guide has a section on Minimum Master Nodes, an excerpt:
When you have a split brain, your cluster is at danger of losing data.
Because the master is considered the supreme ruler of the cluster, it
decides when new indices can be created, how shards are moved, and so
forth. If you have two masters, data integrity becomes perilous, since
you have two nodes that think they are in charge.
This setting tells Elasticsearch to not elect a master unless there
are enough master-eligible nodes available. Only then will an election
take place.
This setting should always be configured to a quorum (majority) of
your master-eligible nodes. A quorum is (number of master-eligible
nodes / 2) + 1. Here are some examples:
If you have ten regular nodes (can hold data, can become master), a
quorum is 6.
If you have three dedicated master nodes and a hundred data nodes, the quorum is 2, since you need to count only nodes that are master eligible.
If you have two regular nodes, you are in a conundrum. A quorum would be 2, but this means a loss of one node will
make your cluster inoperable. A setting of 1 will allow your cluster
to function, but doesn’t protect against split brain. It is best to
have a minimum of three nodes in situations like this.

Replica placement logic in Cassandra with multiple datacenters

When write is being performed with consistency EACH_QUORUM and replication 4 with 2 data centers DC1 and DC2 with replica placement 3 in DC1 and 1 in DC2, which class picks the node where the secondary and tertiary copy should reside? The snitch is GossipingPropertyFileSnitch and NetworkTopologyStrategy. The client creates a new file using FileSystem.create and perform a write to it. First copy will go to node based on the token and row key hash. Where does the second and third copy go in DC1 and in DC2?
The consistency level does not have anything to do with the placement strategy. It is simply, how many nodes should report back to the coordinator before the success or failure is reported back to the client.
Each DC places copies according to their replication factor independently. So in the DC2, the only copy will be stored according to the partitioning function. In the DC1, the replica placement is done according to this document: http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy
The NetworkTopologyStrategy determines replica placement independently
within each data center as follows:
The first replica is placed according to the partitioner (same as with
SimpleStrategy). Additional replicas are placed by walking the ring
clockwise until a node in a different rack is found. If no such node
exists, additional replicas are placed in different nodes in the same
rack. NetworkTopologyStrategy attempts to place replicas on distinct
racks because nodes in the same rack (or similar physical grouping)
can fail at the same time due to power, cooling, or network issues.

What happens if an ElasticSearch node/index/shard gets corrupted

I'm new to ES. We've recently setup a 3 node elasticsearch cluster for our Prod App. Just want to understand what would happen if ElasticSearch node or index or shard gets corrupted.
Thanks!
What would happen actually depends on how you have set up your ES cluster.
With respect to DATA
If you have a singular cluster, a corruption would render your ES setup useless.You would,pretty much,need to setup everything from scratch.
If you have multiple nodes in your cluster,there can be following scenarios-
If you configure a single node as data node and if that goes down,you would have the cluster running but queries would not return any result. You will then need to re-configure a node to behave as data node and restart the cluster.
If you have multiple nodes designated as data node,then a corruption/failure of a node will only affect that node.The rest of the nodes and the ES will in essence perform as usual. The only effect is that the data stored in the corrupted node will obviously be not available. The shards in the corrupted node will become unassigned shards and have to be reassigned to some other data node.
If you have replicas enabled,then there will be no effects in term of data loss. It would simply require the unassigned shards to be re-assigned to some new data node(if and when it is added).
Its best to have a multi-node cluster with at least 2 data nodes and replicas enabled to mitigate shards/data nodes corruption.
This Stackoverflow post explains shards and replicas in an excellent way.
Edit 1:
This is in response to your comment.
Default settings dictate that each node is master eligible and also stores data and hence,each of your nodes can become Master and will also store data.
Lets consider nodes as A,B & C.
Initially, one of them will be designated as master node,e.g. Node A.
Now if Node A goes down,one of the remaining nodes (B & C) will become the master nodes. Queries will now only return results from data stored in Node B & C.
Check out this page for more insights into how cluster works
One way is to you need to take incremental snapshots of your indices and need to restore from that snapshot.

Resources