Elasticsearch - Limiting allocation of shards

Elasticsearch - Limiting allocation of shards - elasticsearch

I've read a number of articles / forums on the placing of indexes/shards but have not yet found a solution to my requirement.
Fundamentally, I want to use Logstash (+ Elasticsearch/Kibana) to build a globally distributed cluster, but I want to limit the placement of primary and replica shards to be local to the region they were created in to reduce WAN traffic, but I also want to be able to query all data as a single dataset.
Example
Let's say I have two ES nodes in UK (uknode1/uknode2), and two in US (usnode1/usnode2).
If Logstash sends some data to usnode1, I want it to place the replica on usnode2, and not send this across the WAN to the uknode* nodes.
I've tried playing around with index and routing allocation settings, but cannot stop the shards being distributed across all 4 nodes. It's slightly complicated by the fact that index names are dynamically built based on the "type" but that's another challenge for a later date. Even with one index, I can't work this it.
I could split this into two separate clusters but I want to be able to query all nodes as a single dataset (via Kibana) so I don't think that is a valid option at this stage as Kibana can only query one cluster.
Is this even possible to achieve?
The reason I ask if this is possible is what would happen if I write to an index called "myTest" on UK node, and the same index on a US node.....as this is ultimately the same index and I'm not sure how ES would handle this.
So if anyone has any suggestions, or just to say "not possible", that would be very helpful.

It's possible, but not recommended. Elasticsearch needs reliable data connection between nodes in the cluster to function, which is difficult to ensure for geographically distributed cluster. A better solution would be to have two clusters, one in UK and another one in US. If you need to search both of them at the same time you can use tribal node.

Thanks. I looked into this a bit more and have the solution which is indeed using tribal nodes.
For anyone who isn't familiar with them, this is a new feature in ES 1.0.0+
What you do is allocate a new ES node as a tribe node, and configure it to connect to all your other clusters, and when you run a query against it, it queries all clusters and returns a consolidated set of results from all of them.
So in my scenario, I have two distinct clusters, one in each region something this.
US Region
cluster.name: us-region
Two nodes in this region called usnode1 and usnode2
Both nodes are master/data nodes
UK Region
cluster.name: uk-region
Two nodes in this region called uknode1 and uknode2
Both nodes are master/data nodes
The you create another ES node and add some configuration to make it a Tribe node.
Edit elasticsearch.yml with something like this :
node.data: false
node.master: false
tribe.blocks.write: false
tribe.blocks.metadata: false
tribe.t1.cluster.name: us-region
tribe.t1.discovery.zen.ping.unicast.hosts: ["usnode1","usnode2"]
tribe.t2.cluster.name: uk-region
tribe.t2.discovery.zen.ping.unicast.hosts: ["uknode1","uknode2"]
You then point Kibana to the tribe node and it worked brilliantly - excellent feature.
Kibana dashboards still save, although I'm not sure how it picks which cluster to save to yet but seems to address my question so a bit more playing and I think it I'll have it sorted.

Related

How Elasticsearch determines which node in cluster to query

I have two ES nodes (in a cluster) in different locations and I would like to determine my application to use the nearest one to avoid network latency.
I have set up Forced Shard Allocation Awareness to make every of these nodes "independent" (containing Primary shard or Replica of another Primary but never both of them) like
cluster.routing.allocation.awareness.force.my_attr.values: my_attr_val1, my_attr_val2
cluster.routing.allocation.awareness.attributes: my_attr
Now I know I can force my query to be run on specific node by adding to the query Preference like:
_only_nodes:my_attr:my_attr_val1
but as far as I understand it would fail in case of node failure - so basically I don't want to do this.
What I can do is to set
_prefer_nodes:my_attr:my_attr_val1
or to just do nothing and let ES do it's job. The question is - is ES choosing "the best" (let say the nearest) node to query or it just takes random one? How Elasticsearch determines the node to ask?
The version of my ES is 5.5.0

The behavior in 5.5, if you don't set the preference, is to route the request to allocated shards in a round-robin fashion. ES doesn't have "nearest node" system.

ElasticSearch Cluster Design Help - Data Nodes

I have been reading up on ES Cluster design and have started to design the cluster we need. Please can someone clarify some of the things that are still not clear to me?
So we want to start off with 3 servers.
At the beginning we will have all three as Master, Data and Ingest with minimum two master. This basically means, we are sticking to defaults.
Question 1 is - What are data nodes exactly? Is full index replicated across other data nodes? So if one goes down, in our case the third one should be promoted to master server and the cluster should function.
Found this link Shards and replicas in Elasticsearch and it explains what data nodes are. So basically if our index has 12 shards, it might be that ES will store 4 primary shards on each data node and 8 replicas. Is this correct?
Question 2: With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.

What are data nodes exactly?
Disks for the shards.
Is full index replicated across other data nodes?
Yes, replica means availability as well, getting the concept of shards is key to understand this and don't get confused.
in our case the third one should be promoted to master server and the cluster should function.
Yes, read about the green, yellow and red statuses, in this case, it will turn from green to yellow, it means is still functioning but actions required, but read about "master eligibility" and also, avoid split brain, very important. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#master-node
With this as starting point, can we add more servers to function as data nodes, ingest nodes etc.
as many as you want, what is the app requirement? high read low write? vice-versa? equals? define how do you want to grow the cluster depending on the use case.
Question 3: We have setup a load balancer in front of the ES nodes, is this the recommended way of accessing ES Clusters over 9200. When ingesting, should this address be used and it will randomly be routed to an ingest node. When querying it should route to a random ES node that can handle searches.
If it is, for instance, a nginx, it works because I have done it, have a clear understanding on the concept of the nodes roles, for example, the "coordinating node" would handle some process flow that some requests might require and nginx is not aware of.
IMO now that you have the instances, it is a great opportunity for you to learn-by-doing and experiment with them, so move the configs, try to reproduce the problems your app might have and see what happens, aha!moments will happen and full grasp is gotten here.

Which Elasticsearch node is better configured in Logstash Elasticsearch output plugin and Kibana

I have ELK stack with Elasticsearch, Logstash and kibana installed on 3 different instances.
Now I want to make 3 node cluster of Elasticsearch.
I will make one node as master and 2 data nodes.
I want to know in logstash config
elasticsearch {
hosts => "http://es01:9200"
Which address I need to enter there master node or data node. and also if I have 3 master nodes then which address I need to write there.
similarly in kibana , I use
elasticsearch.url: es01:9200
In cluster env which url I need to use?

In general, the answer depends on your cluster data size and load.
Nevertheless, I'll try to answer your questions assuming the master node is not a data eligible node as well. This means it only takes care for cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. For this purposes, it is very recommended to have your master node as stable and less loaded as possible.
So, in your logstash config I would put the addresses of your two data nodes as follows:
elasticsearch{
hosts => ["http://es01:9200", "http://es02:9200"]
}
This confirmation maximize performance and fault tolerance as your master do not contain data and if one node failes it will continue to work with the other.
Please note that it is very recommended to have at least 3 master eligible nodes configured in Elasticsearch clusters since if you are loosing the (only) master node you loose data. 3 is to avoid split brain
Regarding kibana, since all nodes in the cluster "knows" each other. You basically can put any address in the cluster. But, for the same reasons as above it is recommended to fill one of your data nodes addresses.
For further reading, please refer to this documentation.
Hope I have managed to help!

Elasticsearch architecture

Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?

You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html

Understanding Elastic Search

Sorry to say this but ES' documentation ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ) is confusing me.
Thanks to the glossary I understand the terms for database, table and row but I have read substantial sections of the documentation and I cannot find answers to:
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html but ironically it leaves those two settings out :/
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a shard is "is a single Lucene instance"?
If I add more nodes later how can I change these values to span the new nodes?
How does sharding work in ES?
How does replica sets work in ES?
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
For reference I read these links first:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html
If that information exists in the documentation then I would be very grateful if you can point me towards it.
Edit:
I am also unsure how auto-discovery works on a distributed network. Short if pinging every public network around how does it connect to the right one that could possibly be on the other side of the world?

Please see below for answers to your points.
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html
but ironically it leaves those two settings out :/
You don't "have" to, but you probably should in especially will want to in production. The default is five shards and one replication.
The number of replications defined is just the number of times your entire index is replicated throughout all of the nodes in your elasticsearch cluster. Think of it as being multiple read copies of a RDBMS database (but in this case, we read and write all copies).
A shard is the number of times I split up, or shard, an index. So, I can have an index with a single shard, or I can have an index with multiple shards. This is similar in concept to sharding a RDBMS database by primary key, but not identical.
So, the total number of shards you will have in an index is the product of number_of_shards and number_of_replicas.
When you do a search, elasticsearch will distribute your search to all possible nodes containing the shards in your index and aggregagate the result for you. You can think of this as a map/ reduce where the map is sending the search out to each shard and the reduce is collecting the results.
Also, you can change the replication number_of_replicas at any time, but you can never change the number_of_shards. This must be set at index creation.
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a
shard is "is a single Lucene instance"?
I think the above mostly answers this, but it's important to remember that elasticsearch is primarily a distributed computing solution to search. We are splitting the work up to multiple shards and possibly machines.
If I add more nodes later how can I change these values to span the new nodes?
Once the cluster is aware of another node in the cluster, no other action is needed by you. The settings propagate throughout the cluster on their own. In your above example of three shards and two replicas, if you had two nodes initially and added a third, each node will have on average two shards per node, this shard movement happens without your intervention (again, provided the cluster is aware of the new node)
How does sharding work in ES?
See above
How does replica sets work in ES?
See above
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
You don't have to "manage" it actively. As stated earlier, sharding and everything else you define at index creation, is propagated to new nodes within the cluster.
You define replicas and shards like this:
{
"settings": {
"index": {
"number_of_shards": 20,
"number_of_replicas": 1
}
},
"mappings": {
"some_type": {
"properties": {
"some_field": {
"type": "long"
}
}
}
}
}
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
You do that through the update indices API, documentation for this specific case is found on there site here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
I just noticed your edit, please see below:
I am also unsure how auto-discovery works on a distributed network.
In the YML config file you set the unicast like this:
discovery.zen.ping.multicast.enabled: false
#discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts: ["ip.add.r.ess", "ip.add.r.ess"]
The middle setting is an important setting, but I commented it out here. That number should always be number of (master nodes / 2) +1. This is to avoid split brain situations. Generally I set all nodes to master eligible.
These settings are for unicast, which is what I think you are going for with your question and not multicast.

In short, an index is broken into shards. Shards can be replicated, meaning multiple copies of the same shard can exist in the same cluster. So if an index has 3 shards and 2 replica's, that means you have nine shards in total of which six are replicas of the three master shards.
ES, will try to balance shards and their replica's across the cluster so that if a node goes down it can fail over from the master shards on that node to replicas. This can confuse some people: a master in elastic search refers to shards, not the actual node. So a single node can have a mix of replica's and master shards.
If you come from the lucene world, a lucene index is not the same thing as an elastic search index. An elastic search index is a logical group of indexed documents with types, mappings and documents. More or less the same as a database schema. A lucene index on the other hand is a group of several files that contains indexed data. When Elastic search creates indexes, what it does is create several lucene indexes (one for each field and shard) and when it replicates, it is basically copying the files of these lucene indices around.
You can't change the number of shards for an index but you can change the number of replicas. Typically what you do when you need to have more shards is create a new index and reindex the data.
In terms of shard management beyond deciding on the number of shards, there's not much to manage by default and ES is pretty good coordinating things by itself, There are a ton of options you can fiddle with once you gain a bit better understanding of how it works. Defaults are pretty OK for most. In terms of cluster management, you can do a lot via the API in terms of shutting down nodes in a controlled way, using index aliases, changing number of replica's, etc.
As for autodiscovery, ES uses local network multicast by default. You can switch to unicast and you probably want to change the default clustername to prevent accidents (had some fun in coffeeshops with unintended clusters forming). You probably don't want to cluster globally. I don't see that ending well.

It's a quite incident that about 80% of your questions are answered in the Video Presentation given by Shay Banon (The creater of ElastiSearch). Though this presentation has much more than you can find anywhere else. Hope this helps.
http://www.infoq.com/presentations/ElasticSearch
This video is a bit low-resolution, so if you want code shown in presentation follow this
https://github.com/kimchy/talks/tree/master/2011/wsnparis

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio