Number of nodes AWS Elasticsearch - elasticsearch

I read documentation, but unfortunately I still don't understand one thing. While creating AWS Elasticsearch domain, I need to choose "Number of nodes" in "Data nodes" section.
If i specify 3 data nodes and 3-AZ, what it actually means?
I have suggestions:
I'll get 3 nodes with their own storages (EBS). One of node is master and other 2 are replicas in different AZ. Just copy of master, not to lose data if master node become broken.
I'll get 3 nodes with their own storages (EBS). All of them will work independent and on their storadges are different data. So at the same time data can be processed by different nodes and store on different storages.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
Please, explain how it works.
Many thanks for any info or links.

I haven't used AWS Elasticsearch, but I've used the Cloud Elasticsearch service.
When you use 3 AZ (availability zones), means that your cluster will use 3 zones in order to make it resilient. If one zone has problems, then the nodes in that zone will have problems as well.
As the description section mentions, you need to specify multiples of 3 if you choose 3 AZ. If you have 3 nodes, then every AZ will have one zone. If one zone has problems, then that node is out, the two remaining will have to pick up from there.
Now in order to answer your question. What do you get with these configurations. You can check so yourself. Use this via kibana or any HTTP client
GET _nodes
Check for the sections:
nodes.roles
nodes.attributes
In the various documentations, blog posts etc you will see that for production usage, 3 nodes and 3 AZ is a good starting point in order to have a resilient production cluster.
So let's take it step by step:
You need an even number of master nodes in order to avoid the split brain problem.
You need more than one node in your cluster in order to make it resilient (if the node is unavailable).
By combining these two you have the minimum requirement of 3 nodes (no mention of zones yet).
But having one master and two data nodes, will not cut it. You need to have 3 master-eligible nodes. So if you have one node that is out, the other two can still form a quorum and vote a new master, so your cluster will be operational with two nodes. But in order for this to work, you need to set your primary shards and replica shards in a way that any two of your nodes can hold your entire data.
Examples (for simplicity we have only one index):
1 primary, 2 replicas. Every node holds one shard which is 100% of the data
3 primaries, 1 replica. Every node will hold one primary and one replica (33% primary, 33% replica). Two nodes combined (which is the minimum to form a quorum as well) will hold all your data (and some more)
You can have more combinations but you get the idea.
As you can see, the shard configuration needs to go along with your number and type of nodes (master-eligible, data only etc).
Now, if you add the availability zones, you take care of the problem of one zone being problematic. If your cluster was as a whole in one zone (3 nodes in one node), then if that zone was problematic then your whole cluster is out.
If you set up one master node and two data nodes (which are not master eligible), having 3 AZ (or 3 nodes even) doesn't do much for resiliency, since if the master goes out, your cluster cannot elect a new one and it will be out until a master node is up again. Now for the same setup if a data node goes out, then if you have your shards configured in a way that there is redundancy (meaning that the two nodes remaining have all the data if combined), then it will work fine.

Your answers should be covered in following three points.
If i specify 3 data nodes and 3-AZ, what it actually means?
This means that your data and replica's will be available in 3AZs with none of the replica in same AZ as the data node. Check this link. For example, When you say you want 2 data nodes in 2 AZ. DN1 will be saved in (let's say) AZ1 and it's replica will be stored in AZ2. DN2 will be in AZ2 and it's replica will be in AZ1.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
It is because when you give your AWS Elasticsearch some amount of storage, the cluster divides the specified storage space in all data nodes. If you specify 100G of storage on the cluster with 2 data nodes, it divides the storage space equally on all data nodes i.e. two data nodes with 50G of available storage space on each.
Sometime you will see more nodes than you specified on the cluster. It took me a while to understand this behaviour. The reason behind this is when you update these configs on AWS ES, it takes some time to stabilize the cluster. So if you see more data or master nodes as expected hold on for a while and wait for it to stabilize.

Thanks everyone for help. To understand how much space available/allocated, run next queries:
GET /_cat/allocation?v
GET /_cat/indices?v
GET /_cat/shards?v
So, if i create 3 nodes, than I create 3 different nodes with separated storages, they are not replicas. Some data is stored in one node, some data in another.

Related

ElasticSearch Cluster Design Help - Data Nodes

I have been reading up on ES Cluster design and have started to design the cluster we need. Please can someone clarify some of the things that are still not clear to me?
So we want to start off with 3 servers.
At the beginning we will have all three as Master, Data and Ingest with minimum two master. This basically means, we are sticking to defaults.
Question 1 is - What are data nodes exactly? Is full index replicated across other data nodes? So if one goes down, in our case the third one should be promoted to master server and the cluster should function.
Found this link Shards and replicas in Elasticsearch and it explains what data nodes are. So basically if our index has 12 shards, it might be that ES will store 4 primary shards on each data node and 8 replicas. Is this correct?
Question 2: With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
What are data nodes exactly?
Disks for the shards.
Is full index replicated across other data nodes?
Yes, replica means availability as well, getting the concept of shards is key to understand this and don't get confused.
in our case the third one should be promoted to master server and the cluster should function.
Yes, read about the green, yellow and red statuses, in this case, it will turn from green to yellow, it means is still functioning but actions required, but read about "master eligibility" and also, avoid split brain, very important. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#master-node
With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
as many as you want, what is the app requirement? high read low write? vice-versa? equals? define how do you want to grow the cluster depending on the use case.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
If it is, for instance, a nginx, it works because I have done it, have a clear understanding on the concept of the nodes roles, for example, the "coordinating node" would handle some process flow that some requests might require and nginx is not aware of.
IMO now that you have the instances, it is a great opportunity for you to learn-by-doing and experiment with them, so move the configs, try to reproduce the problems your app might have and see what happens, aha!moments will happen and full grasp is gotten here.

Multiple datacenter replication and local quorum?

I created a cluster from 6 nodes.
3 nodes in Eu west1 and 3 nodes in EU west2
I set the locality for every group of nodes like : --locality=region=europe,datacenter=west1
I also set the replica to 6 to have all ranges and all data on every node.
What will happen if the connection between data centers is lost the whole cluster goes down ?
I tried to kill 3 nodes in one of the datacenters and cluster is not operational because the majority of the nodes are down and quorum is less that 4.
Is it possible to make the 2 datacentes to work with their local quorum 2/3
I also played a bit with replications settings and sometimes cluster is healthy if I kill 3 nodes from 6 and was I was able to write to the cluster. Sometimes I can only read from the cluster. Cluster is working with replica of 5 and 3 nodes killed from 6. Still paying with this but if someone can give me more information will be very helpful.
To be able to replicate across datacentes is very cool feature but if I lost the whole cluster when one of the datacenters is down ruin the whole good idea at least for me.
CockroachDB requires a majority of replicas to be fully operational, which means > half, not >= half. In order to survive the loss of a full datacenter or region, you must have three DCs/regions, not two. Try running two nodes in each of three regions instead of three nodes in two regions.
Is it possible to make the 2 datacenters to work with their local quorum 2/3
Not for a single table (because it would be impossible to guarantee consistency if each datacenter were able to act in isolation from the other). You've configured the data to be replicated across all six replicas, which means four replicas are required to make a quorum. If you want each datacenter to be able to operate independently of the other, you would need two separate tables, with each one configured to be located within one of the datacenters.
Thanks for the answer just to clear few thing. But looks like you got my point and what I want to accomplish.
But as far as I understand if I have 2x3 node in 2 different DC's if one DC goes down. I have 3 live nodes for the quorum I need at least 4 . N/2 +1.
So if I have 3x3 I can lost one DC because if I have 2 DC's live I will have a quorum .
And one last question if I don't set replication to 9 if I loose 3 nodes some in one DC some ranges will be not available right ?

From how many nodes do you need dedicated master nodes

A question. Is there any recommandation from how many nodes that you need to use dedicated master nodes in a elasticsearch cluster?
My setup:
4 nodes: for non critical data (32GB ram) each. Can be the master node 3
3 nodes: for critial data (16GB ram) each.
Does the master nodes need the same memory requirement as the data nodes?
At a time you can have only one master node, but for availability you should have more than one master elegible by setting node.master
The master node is the only node in a cluster that can make changes to the cluster state. This mean that if your master node is rebooted or down then you will not be able to make any changes to your cluster.
Well at some point it is a bit hard to what is right or best practice because it always depends on many parameters.
With your setup i would better go with 3 nodes and up to 64 GB of memory per each node, other wise you are loosing some hits on communication between your 7 servers while they are not utilizing 100% of resources. Then all 3 nodes must be able to become master and set
discovery.zen.minimum_master_nodes: 2
This parameter is a bit important to avoid brain split when each node could become a master.
For you critical data you must use 1 replica to prevent lose of data.
Other option would be to make master only nodes and data only nodes.
So at some point minimum master nodes should be always 3 this will allow you to upgrade without downtime and make sure that you have always on setup.

How many master in three node cluster

I was stumbled at this question that how many masters can be there in a three node cluster. I came across this point in one of a article on internet that search and index requests are not to be sent to elected master. Is that correct? So , if i have three nodes acting as master(out of which one node is elected master) should i point out incoming logs to be indexed and searched onto other master nodes apart from elected master?Please clarify.Thanks in advance
In a three node cluster, all nodes most likely hold data and are master-eligible. That is the most simple situation in which you don't have to worry about anything else.
If you have a larger cluster, you can have a couple of nodes which are configured as dedicated master nodes. That is, they are master-eligible and they don't hold any data. For example you would have 3 dedicated master nodes and 7 data nodes (not master-eligible). Exactly one of the dedicated master nodes will always be the elected master.
The point is that since the dedicated master nodes don't hold data, they will not directly service index and search request. If you send an index or search request to them there's no other way for them than to delegate to one of the 7 data nodes.
From the Elasticsearch Reference for Modules - Node:
dedicated master nodes are nodes with the settings node.data: false
and node.master: true. We actively promote the use of dedicated master
nodes in critical clusters to make sure that there are 3 dedicated
nodes whose only role is to be master, a lightweight operational
(cluster management) responsibility. By reducing the amount of
resource intensive work that these nodes do (in other words, do not
send index or search requests to these dedicated master nodes), we
greatly reduce the chance of cluster instability.
A related question is how many master nodes there should be in a cluster. The answer essentially is at least 3 in order to prevent split-brain (a situation when due to a network error, two masters are elected simultaneously).
The Elasticsearch Guide has a section on Minimum Master Nodes, an excerpt:
When you have a split brain, your cluster is at danger of losing data.
Because the master is considered the supreme ruler of the cluster, it
decides when new indices can be created, how shards are moved, and so
forth. If you have two masters, data integrity becomes perilous, since
you have two nodes that think they are in charge.
This setting tells Elasticsearch to not elect a master unless there
are enough master-eligible nodes available. Only then will an election
take place.
This setting should always be configured to a quorum (majority) of
your master-eligible nodes. A quorum is (number of master-eligible
nodes / 2) + 1. Here are some examples:
If you have ten regular nodes (can hold data, can become master), a
quorum is 6.
If you have three dedicated master nodes and a hundred data nodes, the quorum is 2, since you need to count only nodes that are master eligible.
If you have two regular nodes, you are in a conundrum. A quorum would be 2, but this means a loss of one node will
make your cluster inoperable. A setting of 1 will allow your cluster
to function, but doesn’t protect against split brain. It is best to
have a minimum of three nodes in situations like this.

elasticsearch cluster setup information

I'm newbie to search and elasticsearch. I have gone some online docs and developed some app using elasticsearch setup in our test environment. So far, its smooth in developing and testing, Now do create in production and setup the cluster, i need some expert advise on,
Number of shards
Number of replicas
Should i need to separate out master and data nodes
can all the nodes be data node
i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
My usecase and traffic patterns are,
Upto 100M read per day
Upto 1M write/update per day
Initial data size 10GB, grow rate 1 GB every 6 months
Cluster info
1. Initial cluster size 14 machines, 28 GB RAM / 120 GB spin hard disk / 12 cores
2. load balancer with dns, would distribute the traffic to any 14 machines.
I have used unicast and i have bootstrap.mlockall: true and index.routing.allocation.disable_allocation: false
Please advise.
Thanks
1. Number of shards
The number of shards in Elasticsearch is a one-time setting, once your shard size is set you cannot change it. So you need to plan how many shards are required for your cluster taking into consideration your current dataset size plus any index growth. To do this set up one Elasticsearch node with one shard and zero replicas on a box that has the same specifications as your production boxes.
The capacity of a single shard will depend on a number of factors:
The size of your documents
The size of your fields
The amount a RAM you assign the JVM that runs Elasticsearch. If you have lots of aggregations,
sorting and parent/child documents, you will need to make sure that you have assigned enough RAM
to Elasticsearch so it can cache the results.
Your number of queries per second requirement.
The maximum search request response time allowed.
Index documents into your single shard node at iterations of x million (or less), at each iteration perform benchmarks by executing x queries per second using a testing tool like JMeter. When the queries in your tests are returning response times that are reaching your maximum search request time you have the amount of documents a single shard can index. Once you have this value you can calculate the number of shards that is required for your full dataset and calculate how many shards you will need for index growth.
2. Number of replicas
Start with 1 replica, a replica shard will be placed on a different node from its primary shard so if one node goes down you still have the full dataset available. One replica is usually sufficient, if you find you need more you can always add them in later on.
3.Should i need to separate out master and data nodes
It depends on the size of your cluster, if you have more than 5 nodes in your cluster it is advisable to have master only nodes to maintain cluster state only.
4. can all the nodes be data node
There must always be at least one master node in your cluster, the master node maintains the cluster state. If you have a small cluster (< 5 nodes), you can make every node in your cluster both a data node and a master node. One of of the nodes will be elected as the master, if the master node goes down another node in the cluster will be elected as the master. If you have master only nodes as described in point 3, the rest of the of the nodes in the cluster can be data only nodes.
5. i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
Yes, stemming will handle your use case.
Also, Elasticsearch comes with very good configurations OOTB, you should start out by only changing the configurations listed in the link below.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_important_configuration_changes.html

Resources