Elasticsearch architecture - elasticsearch

Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?

You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html

Related

ElasticSearch Cluster Design Help - Data Nodes

I have been reading up on ES Cluster design and have started to design the cluster we need. Please can someone clarify some of the things that are still not clear to me?
So we want to start off with 3 servers.
At the beginning we will have all three as Master, Data and Ingest with minimum two master. This basically means, we are sticking to defaults.
Question 1 is - What are data nodes exactly? Is full index replicated across other data nodes? So if one goes down, in our case the third one should be promoted to master server and the cluster should function.
Found this link Shards and replicas in Elasticsearch and it explains what data nodes are. So basically if our index has 12 shards, it might be that ES will store 4 primary shards on each data node and 8 replicas. Is this correct?
Question 2: With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
What are data nodes exactly?
Disks for the shards.
Is full index replicated across other data nodes?
Yes, replica means availability as well, getting the concept of shards is key to understand this and don't get confused.
in our case the third one should be promoted to master server and the cluster should function.
Yes, read about the green, yellow and red statuses, in this case, it will turn from green to yellow, it means is still functioning but actions required, but read about "master eligibility" and also, avoid split brain, very important. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#master-node
With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
as many as you want, what is the app requirement? high read low write? vice-versa? equals? define how do you want to grow the cluster depending on the use case.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
If it is, for instance, a nginx, it works because I have done it, have a clear understanding on the concept of the nodes roles, for example, the "coordinating node" would handle some process flow that some requests might require and nginx is not aware of.
IMO now that you have the instances, it is a great opportunity for you to learn-by-doing and experiment with them, so move the configs, try to reproduce the problems your app might have and see what happens, aha!moments will happen and full grasp is gotten here.

Elastic search coordinating-node

we are new to elasticsearch and beginning to set-up a coordination node for our UI client to query the index. didn't really understand the difference between master node and coordination node. does coordination has to be scaled up separately based in the site traffic? will other nodes share the load?
The master node is responsible for managing the cluster topology. It neither indexes data nor participates in search tasks.
The data nodes are the real work horses of your ES cluster and are responsible for indexing data and running searches/aggregations.
Coordinating nodes (formerly called "client nodes") are some kind of load balancers within your ES cluster. They are optional and if you don't have any coordinating nodes, your data nodes will be the coordinating nodes. They don't index data but their main job is to distribute search tasks to the relevant data nodes (which they know where to find thanks to the master node) and gather all the results before aggregating them and returning them to the client application.
So depending on your cluster size, amount of data and SLA requirements, you might need to spawn one or more coordinating nodes in order to properly serve your clients. Without any real numbers, it is hard to advise anything at this point, but the above describes how each kind of node works.
If you're just beginning and don't have much data, you don't need any dedicated coordinating node, a simple data node is perfectly fine.

About elasticsearch cluster

I need to provide many elasticSearch instances for different clients but hosted in my infrastructre.
For the moment it is only some small instances.
I am wondering if it is not better to build a big ElastSearch Cluster with 3-5 servers to handle all instances and then each client gets a different index in this cluster and each instance is distributed over servers.
Or maybe another idea?
And another question is about quorum, what is the quorum for ES please?
thanks,
You don’t have to assign each client to different index, Elasticsearch cluster will automatically share loading among all nodes which share shards.
If you are not sure how many nodes are needed, start from a small cluster then keep monitoring the health status of cluster. Add more nodes to the cluster if server loading is high; remove nodes if server loading is low.
When the cluster continuously grow, you may need to assign a dedicated role to each node. In this way, you will have more control over the cluster, easier to diagnose the problem and plan resources. For example, adding more master nodes to stabilize the cluster, adding more data nodes to increase searching and indexing performance, adding more coordinate nodes to handle client requests.
A quorum is defined as majority of eligible master nodes in cluster as follows:
(master_eligible_nodes / 2) + 1

Elasticsearch one big cluster VS tribe node?

Problem descriptions:
- Multiple machines producing logs.
- On each machine we have logstash which filters the log files and sends them to a local elasticsearch
- We would like to keep the machines as separate as possible and avoid intercommunication
- But we would also like to be able to visualize all of these logs with a single Kibana instance
Approaches:
Make each machine a single node ES cluster, and have one of the machines as a tribe node with Kibana installed on this machine (of course with avoiding indices conflict)
Make all machines (nodes) part of a single cluster with each node writing to unique index of one shard and statically map each shard to its node, and finally of course having one instance of kibana for the cluster
Question:
Which approach is more appropriate for the described scenario in terms of: limiting inter machine communications, cluster management, and maybe other aspects that I haven't think about ?
Tribe node is there because of this requirements. So my advice to use the Tribe node setup.
With the second option;
There will be a cluster but you will not use its benefits (replica shards, shard relocation, query performance, etc)
Benefits mentioned above will be pain points that will generate configuration complexity and troubleshooting hell.
Besides the shard allocation and node communication there will be other things to configure that nodes will have when they are in a cluster.

Elasticsearch - Limiting allocation of shards

I've read a number of articles / forums on the placing of indexes/shards but have not yet found a solution to my requirement.
Fundamentally, I want to use Logstash (+ Elasticsearch/Kibana) to build a globally distributed cluster, but I want to limit the placement of primary and replica shards to be local to the region they were created in to reduce WAN traffic, but I also want to be able to query all data as a single dataset.
Example
Let's say I have two ES nodes in UK (uknode1/uknode2), and two in US (usnode1/usnode2).
If Logstash sends some data to usnode1, I want it to place the replica on usnode2, and not send this across the WAN to the uknode* nodes.
I've tried playing around with index and routing allocation settings, but cannot stop the shards being distributed across all 4 nodes. It's slightly complicated by the fact that index names are dynamically built based on the "type" but that's another challenge for a later date. Even with one index, I can't work this it.
I could split this into two separate clusters but I want to be able to query all nodes as a single dataset (via Kibana) so I don't think that is a valid option at this stage as Kibana can only query one cluster.
Is this even possible to achieve?
The reason I ask if this is possible is what would happen if I write to an index called "myTest" on UK node, and the same index on a US node.....as this is ultimately the same index and I'm not sure how ES would handle this.
So if anyone has any suggestions, or just to say "not possible", that would be very helpful.
It's possible, but not recommended. Elasticsearch needs reliable data connection between nodes in the cluster to function, which is difficult to ensure for geographically distributed cluster. A better solution would be to have two clusters, one in UK and another one in US. If you need to search both of them at the same time you can use tribal node.
Thanks. I looked into this a bit more and have the solution which is indeed using tribal nodes.
For anyone who isn't familiar with them, this is a new feature in ES 1.0.0+
What you do is allocate a new ES node as a tribe node, and configure it to connect to all your other clusters, and when you run a query against it, it queries all clusters and returns a consolidated set of results from all of them.
So in my scenario, I have two distinct clusters, one in each region something this.
US Region
cluster.name: us-region
Two nodes in this region called usnode1 and usnode2
Both nodes are master/data nodes
UK Region
cluster.name: uk-region
Two nodes in this region called uknode1 and uknode2
Both nodes are master/data nodes
The you create another ES node and add some configuration to make it a Tribe node.
Edit elasticsearch.yml with something like this :
node.data: false
node.master: false
tribe.blocks.write: false
tribe.blocks.metadata: false
tribe.t1.cluster.name: us-region
tribe.t1.discovery.zen.ping.unicast.hosts: ["usnode1","usnode2"]
tribe.t2.cluster.name: uk-region
tribe.t2.discovery.zen.ping.unicast.hosts: ["uknode1","uknode2"]
You then point Kibana to the tribe node and it worked brilliantly - excellent feature.
Kibana dashboards still save, although I'm not sure how it picks which cluster to save to yet but seems to address my question so a bit more playing and I think it I'll have it sorted.

Resources