I have been reading up on ES Cluster design and have started to design the cluster we need. Please can someone clarify some of the things that are still not clear to me?
So we want to start off with 3 servers.
At the beginning we will have all three as Master, Data and Ingest with minimum two master. This basically means, we are sticking to defaults.
Question 1 is - What are data nodes exactly? Is full index replicated across other data nodes? So if one goes down, in our case the third one should be promoted to master server and the cluster should function.
Found this link Shards and replicas in Elasticsearch and it explains what data nodes are. So basically if our index has 12 shards, it might be that ES will store 4 primary shards on each data node and 8 replicas. Is this correct?
Question 2: With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
What are data nodes exactly?
Disks for the shards.
Is full index replicated across other data nodes?
Yes, replica means availability as well, getting the concept of shards is key to understand this and don't get confused.
in our case the third one should be promoted to master server and the cluster should function.
Yes, read about the green, yellow and red statuses, in this case, it will turn from green to yellow, it means is still functioning but actions required, but read about "master eligibility" and also, avoid split brain, very important. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#master-node
With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
as many as you want, what is the app requirement? high read low write? vice-versa? equals? define how do you want to grow the cluster depending on the use case.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
If it is, for instance, a nginx, it works because I have done it, have a clear understanding on the concept of the nodes roles, for example, the "coordinating node" would handle some process flow that some requests might require and nginx is not aware of.
IMO now that you have the instances, it is a great opportunity for you to learn-by-doing and experiment with them, so move the configs, try to reproduce the problems your app might have and see what happens, aha!moments will happen and full grasp is gotten here.
Related
I'm curerntly running a single node ES-Instance. As there are some limitations with a single server setup in ES, and the queries are becoming pretty slow sometimes, I want to upgrade to a full cluster.
The ES-Instance currently only stores data, and is not doing any fancy stuff (Transformations, Ingest Pipelines, ...). All I currently need is a place to store my data at, and to retrieve it (Search + Aggregations). There are more reads than writes.
In a lot of forums and blog posts I read about the "Split-Brain" issue. To circumvent this, the minimum node count should be 3.
The idea is to keep the amount of machines low, because this is a private project and I do not want to also manage a lot of OS in my spare time..
The structure I thought about was:
- 1 Coordinator + Voting-only Node
- 2 Master-eligible + Data Nodes
minimum_master_nodes: 2 to circumvent Split-Brains
Send all ES-Queries to the Coordinator, which will then issue the requests on the data nodes and reduce the final results.
My question is: Does this make sense? Or is it better to use 3 master-eligible + Data nodes?
Online I found no guidance for ES-Newbies to get an idea of the structure of a simple cluster.
You are in right direction and I can see most of your thinking is also right so don't consider yourself as ES newbie :).
Anyway as you are going to have 3 nodes in your cluster, why note make all three nodes as master eligible nodes and why you are making a dedicated co-ordinating node when by default every ES node works as a co-ordinating node and in your small project you won't need a dedicated co-ordinating node. this way you will have a simple configuration, just don't assign any explicit role to any node as by default all ES nodes are master, data and co-ordinating node.
Also, you should invest some time to identify the slow logs and its cause to make it more performant rather than adding more resources that too in personal project, please refer to my short tips on improving the search performance
we are new to elasticsearch and beginning to set-up a coordination node for our UI client to query the index. didn't really understand the difference between master node and coordination node. does coordination has to be scaled up separately based in the site traffic? will other nodes share the load?
The master node is responsible for managing the cluster topology. It neither indexes data nor participates in search tasks.
The data nodes are the real work horses of your ES cluster and are responsible for indexing data and running searches/aggregations.
Coordinating nodes (formerly called "client nodes") are some kind of load balancers within your ES cluster. They are optional and if you don't have any coordinating nodes, your data nodes will be the coordinating nodes. They don't index data but their main job is to distribute search tasks to the relevant data nodes (which they know where to find thanks to the master node) and gather all the results before aggregating them and returning them to the client application.
So depending on your cluster size, amount of data and SLA requirements, you might need to spawn one or more coordinating nodes in order to properly serve your clients. Without any real numbers, it is hard to advise anything at this point, but the above describes how each kind of node works.
If you're just beginning and don't have much data, you don't need any dedicated coordinating node, a simple data node is perfectly fine.
I need to provide many elasticSearch instances for different clients but hosted in my infrastructre.
For the moment it is only some small instances.
I am wondering if it is not better to build a big ElastSearch Cluster with 3-5 servers to handle all instances and then each client gets a different index in this cluster and each instance is distributed over servers.
Or maybe another idea?
And another question is about quorum, what is the quorum for ES please?
thanks,
You don’t have to assign each client to different index, Elasticsearch cluster will automatically share loading among all nodes which share shards.
If you are not sure how many nodes are needed, start from a small cluster then keep monitoring the health status of cluster. Add more nodes to the cluster if server loading is high; remove nodes if server loading is low.
When the cluster continuously grow, you may need to assign a dedicated role to each node. In this way, you will have more control over the cluster, easier to diagnose the problem and plan resources. For example, adding more master nodes to stabilize the cluster, adding more data nodes to increase searching and indexing performance, adding more coordinate nodes to handle client requests.
A quorum is defined as majority of eligible master nodes in cluster as follows:
(master_eligible_nodes / 2) + 1
Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?
You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html
I've read a number of articles / forums on the placing of indexes/shards but have not yet found a solution to my requirement.
Fundamentally, I want to use Logstash (+ Elasticsearch/Kibana) to build a globally distributed cluster, but I want to limit the placement of primary and replica shards to be local to the region they were created in to reduce WAN traffic, but I also want to be able to query all data as a single dataset.
Example
Let's say I have two ES nodes in UK (uknode1/uknode2), and two in US (usnode1/usnode2).
If Logstash sends some data to usnode1, I want it to place the replica on usnode2, and not send this across the WAN to the uknode* nodes.
I've tried playing around with index and routing allocation settings, but cannot stop the shards being distributed across all 4 nodes. It's slightly complicated by the fact that index names are dynamically built based on the "type" but that's another challenge for a later date. Even with one index, I can't work this it.
I could split this into two separate clusters but I want to be able to query all nodes as a single dataset (via Kibana) so I don't think that is a valid option at this stage as Kibana can only query one cluster.
Is this even possible to achieve?
The reason I ask if this is possible is what would happen if I write to an index called "myTest" on UK node, and the same index on a US node.....as this is ultimately the same index and I'm not sure how ES would handle this.
So if anyone has any suggestions, or just to say "not possible", that would be very helpful.
It's possible, but not recommended. Elasticsearch needs reliable data connection between nodes in the cluster to function, which is difficult to ensure for geographically distributed cluster. A better solution would be to have two clusters, one in UK and another one in US. If you need to search both of them at the same time you can use tribal node.
Thanks. I looked into this a bit more and have the solution which is indeed using tribal nodes.
For anyone who isn't familiar with them, this is a new feature in ES 1.0.0+
What you do is allocate a new ES node as a tribe node, and configure it to connect to all your other clusters, and when you run a query against it, it queries all clusters and returns a consolidated set of results from all of them.
So in my scenario, I have two distinct clusters, one in each region something this.
US Region
cluster.name: us-region
Two nodes in this region called usnode1 and usnode2
Both nodes are master/data nodes
UK Region
cluster.name: uk-region
Two nodes in this region called uknode1 and uknode2
Both nodes are master/data nodes
The you create another ES node and add some configuration to make it a Tribe node.
Edit elasticsearch.yml with something like this :
node.data: false
node.master: false
tribe.blocks.write: false
tribe.blocks.metadata: false
tribe.t1.cluster.name: us-region
tribe.t1.discovery.zen.ping.unicast.hosts: ["usnode1","usnode2"]
tribe.t2.cluster.name: uk-region
tribe.t2.discovery.zen.ping.unicast.hosts: ["uknode1","uknode2"]
You then point Kibana to the tribe node and it worked brilliantly - excellent feature.
Kibana dashboards still save, although I'm not sure how it picks which cluster to save to yet but seems to address my question so a bit more playing and I think it I'll have it sorted.