elasticsearch cluster setup information - elasticsearch

I'm newbie to search and elasticsearch. I have gone some online docs and developed some app using elasticsearch setup in our test environment. So far, its smooth in developing and testing, Now do create in production and setup the cluster, i need some expert advise on,
Number of shards
Number of replicas
Should i need to separate out master and data nodes
can all the nodes be data node
i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
My usecase and traffic patterns are,
Upto 100M read per day
Upto 1M write/update per day
Initial data size 10GB, grow rate 1 GB every 6 months
Cluster info
1. Initial cluster size 14 machines, 28 GB RAM / 120 GB spin hard disk / 12 cores
2. load balancer with dns, would distribute the traffic to any 14 machines.
I have used unicast and i have bootstrap.mlockall: true and index.routing.allocation.disable_allocation: false
Please advise.
Thanks

1. Number of shards
The number of shards in Elasticsearch is a one-time setting, once your shard size is set you cannot change it. So you need to plan how many shards are required for your cluster taking into consideration your current dataset size plus any index growth. To do this set up one Elasticsearch node with one shard and zero replicas on a box that has the same specifications as your production boxes.
The capacity of a single shard will depend on a number of factors:
The size of your documents
The size of your fields
The amount a RAM you assign the JVM that runs Elasticsearch. If you have lots of aggregations,
sorting and parent/child documents, you will need to make sure that you have assigned enough RAM
to Elasticsearch so it can cache the results.
Your number of queries per second requirement.
The maximum search request response time allowed.
Index documents into your single shard node at iterations of x million (or less), at each iteration perform benchmarks by executing x queries per second using a testing tool like JMeter. When the queries in your tests are returning response times that are reaching your maximum search request time you have the amount of documents a single shard can index. Once you have this value you can calculate the number of shards that is required for your full dataset and calculate how many shards you will need for index growth.
2. Number of replicas
Start with 1 replica, a replica shard will be placed on a different node from its primary shard so if one node goes down you still have the full dataset available. One replica is usually sufficient, if you find you need more you can always add them in later on.
3.Should i need to separate out master and data nodes
It depends on the size of your cluster, if you have more than 5 nodes in your cluster it is advisable to have master only nodes to maintain cluster state only.
4. can all the nodes be data node
There must always be at least one master node in your cluster, the master node maintains the cluster state. If you have a small cluster (< 5 nodes), you can make every node in your cluster both a data node and a master node. One of of the nodes will be elected as the master, if the master node goes down another node in the cluster will be elected as the master. If you have master only nodes as described in point 3, the rest of the of the nodes in the cluster can be data only nodes.
5. i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
Yes, stemming will handle your use case.
Also, Elasticsearch comes with very good configurations OOTB, you should start out by only changing the configurations listed in the link below.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_important_configuration_changes.html

Related

Configuring Elastic Search cluster with machines of different capacity(CPU, RAM) for rolling upgrades

Due to cost restrictions, I only have the following types of machines at disposal for setting up an ES cluster.
Node A: Lean(w.r.t. CPU, RAM) Instance
Node B: Beefy(w.r.t. CPU,RAM) Instance
Node M: "Leaner than A"(w.r.t. CPU, RAM) Instance
Disk-wise, both A and B have the same size.
My plan is to set up Node A and Node B acting as Master Eligible, Data node and Node M as Master-Eligible Only node(no data storing).
Because the two data nodes are NOT identical, what would be the implications?
I am going to make it a cluster of 3 machines only for the possibility of Rolling Upgrades(current volume of data and expected growth for few years can be managed with vertical scaling and leaving the default no. of shards and replica would enable me to scale horizontally if there is a need)
There is absolutely no need for your machines to have the same specs. You will need 3 master-eligible nodes not just for rolling-upgrades, but for high availability in general.
If you want to scale horizontally you can do so by either creating more indices to hold your data, or configure your index to have multiple primary and or replica shards. Since version 7 the default for new indices is to get created with 1 primary and 1 replica shard. A single index like this does not really allow you to schedule horizontally.
Update:
With respect to load and shard allocation (where to put data), Elasticsearch by default will simply consider the amount of storage available. When you start up an instance of Elasticsearch, it introspects the hardware and configures its threadpools (number of threads & size of queue) for various tasks accordingly. So the number of available threads to process tasks can vary. If I‘m not mistaken the coordinating node (the node receiving the external request) will distribute indexing/write requests in a round-robin fashion, not taking a load into consideration. Depending on your version of Elasticsearch, this is different for search/read requests where the coordinating node will leverage adaptive replica selection, taking into account the load/response time of the various replicas when distributing requests.
Besides this, sizing and scaling is a too complex topic to be answered comprehensively in a simple response. It typically also involves testing to figure out the limits/boundaries of a single node.
BTW: the number of default primary shards got changed in v7.x of Elasticsearch, as too much oversharding was one of the most common issues Elasticsearch users were facing. A “reasonable” shard size is in the tens of Gigabytes.

How many shards should I use with Elasticsearch on a dev & CI environment?

By default Elasticsearch is configured to start with 5 shards.
Is there a reason to use 5 shards locally (on my development machine) and on the continuous integration server (for integration tests)? Is it better to use 1?
Obviously I don't care about scalability in those cases, I just want the simplest setup.
The simplest setup is 1 primary shard, 0 replicas.
If you only have one node and replica count is >0 it will always be yellow. Not a problem per se, but those will not be needed.
If you want to test search response time with that one shard, for example, it depends on some factors if 1 is enough or you need more. The simplest rule of thumb is to have shards no larger than 30-50GB, for example. But this number also depends on factors.
So, I'd say if you have one node, start with 1 primary, 0 replicas. If that primary is too "large", think about having more primaries (each shard will do part of the work and each will use one core CPU for searching).
Once you've pushed some data with a specific shard configuration, you cannot set a different number of shards without re-index your data. So my guess is that the default configuration of elasticsearch is made so that you can scale your cluster to 5 nodes (then each node gets one shard) without headaches.
from the elasticsearch documentation:
A new index in Elasticsearch is allotted five primary shards by default. That means that we can spread that index out over a maximum of five nodes, with one shard on each node. That’s a lot of capacity, and it happens without you having to think about it at all!

How to configure number of shards per cluster in elasticsearch

I don't understand the configuration of shards in ES.
I have few questions about sharding in ES:
The number of primary shards is configured through index.number_of_shards parameter, right?
So, it means that the number of shards are configured per index.
If so, if I have 2 indexes, then I will have 10 shards on the node ?
Assuming I have one node (Node 1) that configured with 3 shards and 1 replica.
Then, I create a new node (Node 2), in the same cluster, with 2 shards.
So, I assume I will have replica only to two shards, right?
In addition, what happens in case Node 1 is down, how the cluster "knows" that it should have 3 shards instead of 2? Since I have only 2 shards on Node 2, then it means that I lost the data of one of the shards in Node 1 ?
So first off I'd start reading about indexes, primary shards, replica shards and nodes to understand the differences:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
This is a pretty good description:
2.3 Index Basics
The largest single unit of data in elasticsearch is an index. Indexes
are logical and physical partitions of documents within elasticsearch.
Documents and document types are unique per-index. Indexes have no
knowledge of data contained in other indexes. From an operational
standpoint, many performance and durability related options are set
only at the per-index level. From a query perspective, while
elasticsearch supports cross-index searches, in practice it usually
makes more organizational sense to design for searches against
individual indexes.
Elasticsearch indexes are most similar to the ‘database’ abstraction
in the relational world. An elasticsearch index is a fully partitioned
universe within a single running server instance. Documents and type
mappings are scoped per index, making it safe to re-use names and ids
across indexes. Indexes also have their own settings for cluster
replication, sharding, custom text analysis, and many other concerns.
Indexes in elasticsearch are not 1:1 mappings to Lucene indexes, they
are in fact sharded across a configurable number of Lucene indexes, 5
by default, with 1 replica per shard. A single machine may have a
greater or lesser number of shards for a given index than other
machines in the cluster. Elasticsearch tries to keep the total data
across all indexes about equal on all machines, even if that means
that certain indexes may be disproportionately represented on a given
machine. Each shard has a configurable number of full replicas, which
are always stored on unique instances. If the cluster is not big
enough to support the specified number of replicas the cluster’s
health will be reported as a degraded ‘yellow’ state. The basic dev
setup for elasticsearch, consequently, always thinks that it’s
operating in a degraded state given that by default indexes, a single
running instance has no peers to replicate its data to. Note that this
has no practical effect on its operation for development purposes. It
is, however, recommended that elasticsearch always run on multiple
servers in production environments. As a clustered database, many of
data guarantees hinge on multiple nodes being available.
From here: http://exploringelasticsearch.com/modeling_data.html#sec-modeling-index-basics
When you create an index it you tell it how many primary and replica shards http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html. ES defaults to 5 primary shard and 1 replica shard per primary for a total of 10 shards.
These shards will be balanced over how many nodes you have in the cluster, provided that a primary and it's replica(s) cannot reside on the same node. So if you start with 2 nodes and the default 5 primary shards and 1 replica per primary you will get 5 shards per node. Add more nodes and the number of shards per node drops. Add more indexes and the number of shards per node increases.
In all cases the number of nodes must be 1 greater than the configured number of replicas. So if you configure 1 replica you should have 2 nodes so that the primary can be on one and the single replica on the other, otherwise the replicas will not be assigned and your cluster status will be Yellow. If you have it configured for 2 replicas which means 1 primary shard and 2 replica shards you need at least 3 nodes to keep them all separate. And so on.
Your questions can't be answered directly because they are based on incorrect assumptions about how ES works. You don't add a node with shards - you add a node and then ES will re-balance the existing shards across the entire cluster. Yes, you do have some control over this if you want but I would not do so until you are much more familiar with ES operations. I'd read up on it here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html and here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html and here: http://exploringelasticsearch.com/advanced_techniques.html#advanced-routing
From the last link:
8.1.1 How Elasticsearch Routing Works
Understanding routing is important in large elasticsearch clusters. By
exercising fine-grained control over routing the quantity of cluster
resources used can be severely reduced, often by orders of magnitude.
The primary mechanism through which elasticsearch scales is sharding.
Sharding is a common technique for splitting data and computation
across multiple servers, where a property of a document has a function
returning a consistent value applied to it in order to determine which
server it will be stored on. The value used for this in elasticsearch
is the document’s _id field by default. The algorithm used to convert
a value to a shard id is what’s known as a consistent hashing
algorithm.
Maintaining good cluster performance is contingent upon even shard
balancing. If data is unevenly distributed across a cluster some
machines will be over-utilized while others will remain mostly idle.
To avoid this, we want as even a distribution of numbers coming out of
our consistent hashing algorithm as possible. Document ids hash well
generally because they are evenly distributed if they are either UUIDs
or monotonically increasing ids (1,2,3,4 …).
This is the default approach, and it generally works well as it solves
the problem of evening out data across the cluster. It also means that
fetches for a single document only need to be routed to the shard that
document hashes to. But what about routing queries? If, for instance,
we are storing user history in elasticsearch, and are using UUIDs for
each piece of user history data, user data will be stored evenly
across the cluster. There’s some waste here, however, in that this
means that our searches for that user’s data have poor data locality.
Queries must be run on all shards within the index, and run against
all possible data. Assuming that we have many users we can likely
improve query performance by consistently routing all of a given
user’s data to a single shard. Once the user’s data has been
so-segmented, we’ll only need to execute across a single shard when
performing operations on that user’s data.
Yes, the number of shards is per index. So if you had 2 indexes, each with 5 shards, then yes, you would have a total of 10 shards distributed across all your nodes.
The number of shards is unrelated to the number of nodes in the cluster. If you have 3 shards and one node, obviously all 3 shards will reside on that one node. However, if you then add an additional node, more shards are not magically created and you can't specify that a certain number of shards should reside on that new node. Rather, the existing shards are distributed as evenly as possible across all nodes resulting in one node with 2 shards and one node with 1 shard, for a total of 3. If you added a third node, then each node would house 1 shard for a total of 3. In other words, the number of shards is fixed and doesn't scale as you add more nodes.
As to your final question, it's based on a false premise, so it's difficult to answer. Rather, lets take the example of above of three shards and two nodes. In that setup, one node will house 2 shards and one node will house 1 shard. If either of those nodes go down, your cluster goes down, because neither has a complete set of shards. The first node is missing 1 shard and the second node is missing 2. This is where replicas come in. By adding replicas, you can ensure that each node ends up with a full set of shards. For example, if you added 1 replica in the above scenario, then the first node would have 2 active shards and 1 replica of the third that lives on the other node. The second node would have 1 active shard and 1 replica each of the two that live on the first. As a result, if either node went down, the cluster can merely activate the replicas and still continue working.
1) Yes, the number of shards is configured per index. It is a static operation and should be done while creating an index. If you want to change the number of shards at a later point of time, you have to reindex the document again and takes time.
2) The number of shards don't depend on number of nodes in you cluster. Lets say you are a book seller website. You have 100 books to sell. your website have an elastic cluster with 3 nodes. you create a book index with 5 shards. Each and very shard contains 20 books. 2 shards will reside on node1, 2 shards will reside in node2 and 1 shard will reside in node3. now let's say node 2 has gone down. But, still we have 2 shards in node 1 and 1 shard in node 3. Querying elastic search will still return the data on node 1 and node 3. i.e, 60 books data will still be available. 40 books data is lost.
But, the overall cluster status will be red meaning cluster is partially functioning, but somedata is not available.
To make the system fault tolerant you can configure replicas. By default elasticsearch makes one replica of each shard. So in this case if the default configuration is not over written the copy of 2 shards on node 2 will be replicated either on node 1 or node 3 and they become the primary shards when node 2 is not available. So all the data is available even when node 2 is down.
in this case the overall cluster health will be yellow, meaning cluster is still functional. But some nodes are lost.
Answer 1) yes you will have 10 shards fr 2 index with 5 shards.
Answer 2) I think you confused with shards and index.
Shards are split piece for index not for node.
If you create a index with 3 shards and 1 replica.
You will get 3 primary shard and 3 replica shards.
If you start new node the shards will be balanced with new node.So you will have 3 shard in old node and 3 shards in new node.
If old node fails you will survive with new node data.It will have exact copy of old node.
To understand basic concepts of elasticsearch refer
HOpe it helps..!

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured

Shards / Replicas settings for high availability

We have java application with embedded Elasticsearch in a cluster of 14 nodes. All the data resides in a central database, and they are indexed in elasticsearch for querying. A full reindex can be done at any time.
The system are very query-heavy, the amount of writes are small. The number of documents will not be higher than, say, 300.000.
The size of each document varies greatly, from just a couple of ids, to extracted text from e.g word-documents of several pages.
I want to make sure that in case of a total breakdown, it should be sufficient that one or two nodes are available for the system to work.
Write consistency should not be a problem since the master copy of the data is in the database, and it seems that ES is capable of resolving conflicting data by using the newest version (which should be all right in our case)
My first though is to use 1 shard, and 13 replicas. This will naturally ensure that all nodes have access to all data. This could also be accomplished by having 2 shards / 13 replicas, so this yield that to ensure that all data is available, the number of replicas should be the number of nodes - 1, not depending on the number of shards (which could be anything).
If the requirement of number of nodes are reduced to "2 nodes should be up at any time", then a shards / replica distribution of "x/number of nodes - 2" should be sufficient.
So, for the question:
Asserting the above setup and that my thoughts is correct, would a setup with 1 shard / 13 replicas make sense or would there be anything to gain by adding more shards and run e.g a 4 shards/13 replicas setup?
After a good bit of research and talking to ES-gurus;
As long as the shard size is small enough, the most efficient way of setting up this cluster would indeed be 1 shard only, with 13 replicas. I have not been able to pinpoint the threshold size of the shard for this starting to perform worse.
If the index is big... you will need more than one shard (if you want perfomance). Do You really need 13 replica? When you put only 2 replicas, ES manage that to keep it that way, if the principal node fail, ES will create a new reply. May be you will need a balancer node too.

Resources