How Elasticsearch determines which node in cluster to query - elasticsearch

I have two ES nodes (in a cluster) in different locations and I would like to determine my application to use the nearest one to avoid network latency.
I have set up Forced Shard Allocation Awareness to make every of these nodes "independent" (containing Primary shard or Replica of another Primary but never both of them) like
cluster.routing.allocation.awareness.force.my_attr.values: my_attr_val1, my_attr_val2
cluster.routing.allocation.awareness.attributes: my_attr
Now I know I can force my query to be run on specific node by adding to the query Preference like:
_only_nodes:my_attr:my_attr_val1
but as far as I understand it would fail in case of node failure - so basically I don't want to do this.
What I can do is to set
_prefer_nodes:my_attr:my_attr_val1
or to just do nothing and let ES do it's job. The question is - is ES choosing "the best" (let say the nearest) node to query or it just takes random one? How Elasticsearch determines the node to ask?
The version of my ES is 5.5.0

The behavior in 5.5, if you don't set the preference, is to route the request to allocated shards in a round-robin fashion. ES doesn't have "nearest node" system.

Related

Querying specific Elastic Search Node - Do both does the same or not?

I have 2 nodes Elasticsearch cluster with IP addresses of xx.xx.xx.17(master) and xx.xx.xx.18(data). I know this is the documented way of searching on preferred replica/node.
The question is, If I send my request targeting xx.xx.xx.18(data) node (as an example- http://xx.xx.xx.18:9200/product/_count) will the request be querying that specific node?
OR is the only way of querying a preferred node is sending it with the 'preferred' parameter as in the above link?
when you send a query to an Elasticsearch node, it will talk to any and all other nodes that hold data for indices that need to be queried. if you have replicas assigned to indices, Elasticsearch will randomly pick between the primary and (n) replica shards
assuming each node of yours holds a full copy of every shard, either primary or replica, this means you might get your response from all shards on that node or not, which is what LeBigCat hints on above
however you can use preference here, yes. but it's not clear what problem you are trying to solve with this

setting up a basic elasticsearch cluster

Im new to elasticsearch and would like someone to help me clarify a few concepts
Im designing a small cluster with the following requirements
everything should still work when restarting one of the machines, one at a time (eg: OS updates)
a single disk failure is ok
heavy indexing should not impact query performance
How many master, data, ingest nodes should I have?
or do I need 2 clusters?
the indexing workload is purely indexing structured text documents, no processing/rules... do I even need an ingest node?
Also, does each node have a complete copy of the all the data? or only a cluster has the complete copy?
Be sure to read the documentation about Elasticsearch terminology at the very least.
With the default of 1 replica (primary shard and one replica shard) you can survive the failure of 1 Elasticsearch node (failed disk, restart, upgrade,...).
"heavy indexing should not impact query performance": You'll need to size your cluster correctly to handle both the indexing and searching. If you want to read current data and you do heavy updates, that will take up resources and you won't be able to fully decouple it.
By default every node is a data, ingest, and master-eligible node. The minimum HA setting needs 3 nodes. If you don't use ingest that's fine; it won't take up resources when you're not using it.
To understand which node has which data, you need to read up on the concept of shards. Basically every index is broken up into 1 to N shards (current default is 5) and there is one primary and one replica copy of each one of them (by default).

Shards and replicas elastic search

Suppose at the time of index creation I didn't set any replica for that if I update using update settings API and changed replica status to 1.If I have 2 node the replica should be create on second node because on primary node side replica will not create due to that cluster status is showing yellow the shards not allocating to node2 even though we enabled the replicas to 1.
please share me why replica shard not allocating to node2?
but on cluster startup nodes are showing they detected and join each other.
Here are the Basic concepts of the Elastic search
Installation »
Basic Concepts
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Near Realtime (NRT)
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
Index
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
Type
Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
Document
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
Note:
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

elasticsearch cluster setup information

I'm newbie to search and elasticsearch. I have gone some online docs and developed some app using elasticsearch setup in our test environment. So far, its smooth in developing and testing, Now do create in production and setup the cluster, i need some expert advise on,
Number of shards
Number of replicas
Should i need to separate out master and data nodes
can all the nodes be data node
i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
My usecase and traffic patterns are,
Upto 100M read per day
Upto 1M write/update per day
Initial data size 10GB, grow rate 1 GB every 6 months
Cluster info
1. Initial cluster size 14 machines, 28 GB RAM / 120 GB spin hard disk / 12 cores
2. load balancer with dns, would distribute the traffic to any 14 machines.
I have used unicast and i have bootstrap.mlockall: true and index.routing.allocation.disable_allocation: false
Please advise.
Thanks
1. Number of shards
The number of shards in Elasticsearch is a one-time setting, once your shard size is set you cannot change it. So you need to plan how many shards are required for your cluster taking into consideration your current dataset size plus any index growth. To do this set up one Elasticsearch node with one shard and zero replicas on a box that has the same specifications as your production boxes.
The capacity of a single shard will depend on a number of factors:
The size of your documents
The size of your fields
The amount a RAM you assign the JVM that runs Elasticsearch. If you have lots of aggregations,
sorting and parent/child documents, you will need to make sure that you have assigned enough RAM
to Elasticsearch so it can cache the results.
Your number of queries per second requirement.
The maximum search request response time allowed.
Index documents into your single shard node at iterations of x million (or less), at each iteration perform benchmarks by executing x queries per second using a testing tool like JMeter. When the queries in your tests are returning response times that are reaching your maximum search request time you have the amount of documents a single shard can index. Once you have this value you can calculate the number of shards that is required for your full dataset and calculate how many shards you will need for index growth.
2. Number of replicas
Start with 1 replica, a replica shard will be placed on a different node from its primary shard so if one node goes down you still have the full dataset available. One replica is usually sufficient, if you find you need more you can always add them in later on.
3.Should i need to separate out master and data nodes
It depends on the size of your cluster, if you have more than 5 nodes in your cluster it is advisable to have master only nodes to maintain cluster state only.
4. can all the nodes be data node
There must always be at least one master node in your cluster, the master node maintains the cluster state. If you have a small cluster (< 5 nodes), you can make every node in your cluster both a data node and a master node. One of of the nodes will be elected as the master, if the master node goes down another node in the cluster will be elected as the master. If you have master only nodes as described in point 3, the rest of the of the nodes in the cluster can be data only nodes.
5. i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
Yes, stemming will handle your use case.
Also, Elasticsearch comes with very good configurations OOTB, you should start out by only changing the configurations listed in the link below.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_important_configuration_changes.html

Elasticsearch - Limiting allocation of shards

I've read a number of articles / forums on the placing of indexes/shards but have not yet found a solution to my requirement.
Fundamentally, I want to use Logstash (+ Elasticsearch/Kibana) to build a globally distributed cluster, but I want to limit the placement of primary and replica shards to be local to the region they were created in to reduce WAN traffic, but I also want to be able to query all data as a single dataset.
Example
Let's say I have two ES nodes in UK (uknode1/uknode2), and two in US (usnode1/usnode2).
If Logstash sends some data to usnode1, I want it to place the replica on usnode2, and not send this across the WAN to the uknode* nodes.
I've tried playing around with index and routing allocation settings, but cannot stop the shards being distributed across all 4 nodes. It's slightly complicated by the fact that index names are dynamically built based on the "type" but that's another challenge for a later date. Even with one index, I can't work this it.
I could split this into two separate clusters but I want to be able to query all nodes as a single dataset (via Kibana) so I don't think that is a valid option at this stage as Kibana can only query one cluster.
Is this even possible to achieve?
The reason I ask if this is possible is what would happen if I write to an index called "myTest" on UK node, and the same index on a US node.....as this is ultimately the same index and I'm not sure how ES would handle this.
So if anyone has any suggestions, or just to say "not possible", that would be very helpful.
It's possible, but not recommended. Elasticsearch needs reliable data connection between nodes in the cluster to function, which is difficult to ensure for geographically distributed cluster. A better solution would be to have two clusters, one in UK and another one in US. If you need to search both of them at the same time you can use tribal node.
Thanks. I looked into this a bit more and have the solution which is indeed using tribal nodes.
For anyone who isn't familiar with them, this is a new feature in ES 1.0.0+
What you do is allocate a new ES node as a tribe node, and configure it to connect to all your other clusters, and when you run a query against it, it queries all clusters and returns a consolidated set of results from all of them.
So in my scenario, I have two distinct clusters, one in each region something this.
US Region
cluster.name: us-region
Two nodes in this region called usnode1 and usnode2
Both nodes are master/data nodes
UK Region
cluster.name: uk-region
Two nodes in this region called uknode1 and uknode2
Both nodes are master/data nodes
The you create another ES node and add some configuration to make it a Tribe node.
Edit elasticsearch.yml with something like this :
node.data: false
node.master: false
tribe.blocks.write: false
tribe.blocks.metadata: false
tribe.t1.cluster.name: us-region
tribe.t1.discovery.zen.ping.unicast.hosts: ["usnode1","usnode2"]
tribe.t2.cluster.name: uk-region
tribe.t2.discovery.zen.ping.unicast.hosts: ["uknode1","uknode2"]
You then point Kibana to the tribe node and it worked brilliantly - excellent feature.
Kibana dashboards still save, although I'm not sure how it picks which cluster to save to yet but seems to address my question so a bit more playing and I think it I'll have it sorted.

Resources