elasticsearch - how to add new shards and split index content - elasticsearch

So I my index growth too fast and now has 60 million docs in 3 shards (single node).
I want to buy more machines and split content into more shards. How can I do this?
It's just connect new nodes to the cluster and update shards number in master?

Afaik elasticsearch cannot yet redistribute indexed documents automatically (see here). You would have to reindex all content. The problem behind it is, that documents are distributed to shards according to a hash value modulo number of shards. Just adding shards and keeping indexing would keep adding documents to the old shards too.
Elasticsearch allows to distribute documents according to a custom function (routing parameter). You could distribute all new content to the new shards, but this makes deletions difficult, because now you have to know if a document is "old" or "new". Further it ruins your uniform index statistics which may bias ranking in nonobvious ways.
Bottom line: adding shards to an existing index requires reindexing all contents or some heavy hacking.

You already have 3 shards, so if you add 2 nodes Elasticsearch will automatically reallocate 2 shards to the other 2 nodes, giving all shard 3 times more power.
If you want to add more shards, you need to reindex your data. This can be done by creating a new index with the desired number of shards and copying your data to that index (see https://www.elastic.co/guide/en/elasticsearch/guide/current/reindex.html)

Related

In Elasticsearch cluster, is there a way through which shards can be allocated a particular node during the time of creation?

I have a multinode elasticsearch cluster. On that cluster, I want to divide shards of same index on different nodes.
Suppose a document is to be ingested into the index that have different key-value pairs. Based on that key-value, I want my master-node to allocate a specific data-node that contains a list of documents having the same key-value.
My approach is to have a single index across the nodes available in the cluster and the shards of this index should get distributed in such a manner that the document having similar key-value pair be on same node. Is there a way around to this?
Also I want to increase number of shards in an index but getting error, "index <index_name> must be read-only to resize index." How do I increase number of shards?
there is the _routing field which can group documents in a particular shard. but you cannot automatically assign a shard with a value to a specific node. the closest you could get would be to manually handle it via reroute
however why you would want to do that is not clear, and definitely not recommended as it's a lot of manual control over something that Elasticsearch is pretty good at handling

What exactly is a shard?

I am going through the documentation,
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
What exactly constitutes elastic search shard ? Is it a lucene thread which is configured with memory ? Is it possible to adjust setting for individual shard ?
In addition to this answer which should help, I can add that a shard actually wraps a full-fledge Lucene search engine.
You cannot change settings for individual shards, instead you can change settings at the index-level and Elasticsearch will apply them on the index shards.
So Elasticsearch gives you the ability to split the workload on an index among all the shards (i.e. Lucene engines) of that index which are located on different nodes.
Very simply put: Elasticsearch = distributed Lucene !

ElasticSearch - How does sharding affect indexing performance?

I'm doing some benchmarks on a single-node cluster of ElasticSearch.
I faced to the situation that more shards will reduce the
indexing performance -at least in a single node- (both in latency and throughput)
These are some of my numbers:
Index with 1 shard it indexed +6K documents per minute
Index with 5 shards it indexed +3K documents per minute
Index with 20 shards it indexed +1K documents per minute
I had the same results with bulk API. So I'm wondering what's the relation and why this happens?
Note: I don't have the resource problem! Resources are free (CPU & Memory)
Just to have you on the same page:
Your data is organized in indices, each made of shards and distributed across multiple nodes. If a new document needs to be indexed, a new id is being generated and the destination shard is being calculated based on this id. After that, the write is delegated to the node, which is holding the calculated destination shard. This will distribute your documents pretty well across all of your shards.
Finding documents by id is now easy, as the shard, containing the wanted document, can be calulated just based on the id. There is no need for searching all shards. BTW, that's the reason why you can't change the number of shards afterwards. The changed shard number will result in a different document distribution across your shards.
Now, just to make it clear, each shard is a separate lucene index, made of segment files located on your disk. When writing, new segments will be created. If a particular number of segment files will be reached, the segments will be merged.
So just introducing more shards without distributing them to other nodes will just introduce a higher I/O and memory consumption for your single node.
While searching, the query will be executed against each shard. Afterwards the results of all shards needs to be merged into one result - more shards, more cpu work to do...
Coming back to your question:
For your write heavy indexing case, with just one node, the optimal number of indices and shards is 1!
But for the search case (not accessing by id), the optimal number of shards per node is the number of CPUs available. In such a way, searching can be done in multiple threads, resulting in better search performance. Correction: Searching and indexing are multithreaded, a single shard can fully utilize all CPU cores from a node.
But what are the benefits of sharding?
Availability: By replicating the shards to other nodes you can still serve if some of your nodes can´t be reached anymore!
Performance: Distibuting the primary shards to different nodes, will distribute the workload too.
So if your scenario is write heavy, keep the number of shards per index low. If you need better search performance, increase the number of shards, but keep the "physics" in mind. If you need reliability, take the number of nodes/replicas into account.
Further readings:
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
https://www.elastic.co/de/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
I faced to the situation that more shards will reduce the indexing
performance -at least in a single node- (both in latency and
throughput)
For reference: Elasticsearch is a distributed database. Data is stored in an "index", the index is split into "shards". Each "shard" is allocated on a node (a different node if possible).
Having more shards allows to use more machines. This is very much how the "distributed" in "distributed database" actually work. Elasticsearch will automatically allocate and move shards in the background, to balance disk usage across all machines.
With 1 shards, the data is split onto one node, this gives you a baseline of N reads and M writes per second.
With 3 shards, the data is split onto three nodes, this gives you 3 times the throughput.
Of course this assumes that there are 3 machines available. If there is a single machine, then the machine is doing all the processing either way and having more shards has no effect.
There is a bit of overhead with sharding, gotta distribute queries and merge back results, hence doubling the amount of shards will not exactly double performance (expect in the order of +90%).
Your cluster has a single machine. You lose performance when you increase the amount of shards, because it's just increasing the overhead.
P.S. Shards have a replica by default, the replica will take over if the primary is gone (machine failed), this is how resiliency works. An index with 5 shards and 5 replicas can fully utilize 10 nodes. Meaning it takes few shards to use many many nodes.
P.P.S In my experience a configuration of shard=5 is a maximum. You should never set more than that, unless working with large clusters (10+ machines) or terabytes indexes.

How to configure number of shards per cluster in elasticsearch

I don't understand the configuration of shards in ES.
I have few questions about sharding in ES:
The number of primary shards is configured through index.number_of_shards parameter, right?
So, it means that the number of shards are configured per index.
If so, if I have 2 indexes, then I will have 10 shards on the node ?
Assuming I have one node (Node 1) that configured with 3 shards and 1 replica.
Then, I create a new node (Node 2), in the same cluster, with 2 shards.
So, I assume I will have replica only to two shards, right?
In addition, what happens in case Node 1 is down, how the cluster "knows" that it should have 3 shards instead of 2? Since I have only 2 shards on Node 2, then it means that I lost the data of one of the shards in Node 1 ?
So first off I'd start reading about indexes, primary shards, replica shards and nodes to understand the differences:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
This is a pretty good description:
2.3 Index Basics
The largest single unit of data in elasticsearch is an index. Indexes
are logical and physical partitions of documents within elasticsearch.
Documents and document types are unique per-index. Indexes have no
knowledge of data contained in other indexes. From an operational
standpoint, many performance and durability related options are set
only at the per-index level. From a query perspective, while
elasticsearch supports cross-index searches, in practice it usually
makes more organizational sense to design for searches against
individual indexes.
Elasticsearch indexes are most similar to the ‘database’ abstraction
in the relational world. An elasticsearch index is a fully partitioned
universe within a single running server instance. Documents and type
mappings are scoped per index, making it safe to re-use names and ids
across indexes. Indexes also have their own settings for cluster
replication, sharding, custom text analysis, and many other concerns.
Indexes in elasticsearch are not 1:1 mappings to Lucene indexes, they
are in fact sharded across a configurable number of Lucene indexes, 5
by default, with 1 replica per shard. A single machine may have a
greater or lesser number of shards for a given index than other
machines in the cluster. Elasticsearch tries to keep the total data
across all indexes about equal on all machines, even if that means
that certain indexes may be disproportionately represented on a given
machine. Each shard has a configurable number of full replicas, which
are always stored on unique instances. If the cluster is not big
enough to support the specified number of replicas the cluster’s
health will be reported as a degraded ‘yellow’ state. The basic dev
setup for elasticsearch, consequently, always thinks that it’s
operating in a degraded state given that by default indexes, a single
running instance has no peers to replicate its data to. Note that this
has no practical effect on its operation for development purposes. It
is, however, recommended that elasticsearch always run on multiple
servers in production environments. As a clustered database, many of
data guarantees hinge on multiple nodes being available.
From here: http://exploringelasticsearch.com/modeling_data.html#sec-modeling-index-basics
When you create an index it you tell it how many primary and replica shards http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html. ES defaults to 5 primary shard and 1 replica shard per primary for a total of 10 shards.
These shards will be balanced over how many nodes you have in the cluster, provided that a primary and it's replica(s) cannot reside on the same node. So if you start with 2 nodes and the default 5 primary shards and 1 replica per primary you will get 5 shards per node. Add more nodes and the number of shards per node drops. Add more indexes and the number of shards per node increases.
In all cases the number of nodes must be 1 greater than the configured number of replicas. So if you configure 1 replica you should have 2 nodes so that the primary can be on one and the single replica on the other, otherwise the replicas will not be assigned and your cluster status will be Yellow. If you have it configured for 2 replicas which means 1 primary shard and 2 replica shards you need at least 3 nodes to keep them all separate. And so on.
Your questions can't be answered directly because they are based on incorrect assumptions about how ES works. You don't add a node with shards - you add a node and then ES will re-balance the existing shards across the entire cluster. Yes, you do have some control over this if you want but I would not do so until you are much more familiar with ES operations. I'd read up on it here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html and here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html and here: http://exploringelasticsearch.com/advanced_techniques.html#advanced-routing
From the last link:
8.1.1 How Elasticsearch Routing Works
Understanding routing is important in large elasticsearch clusters. By
exercising fine-grained control over routing the quantity of cluster
resources used can be severely reduced, often by orders of magnitude.
The primary mechanism through which elasticsearch scales is sharding.
Sharding is a common technique for splitting data and computation
across multiple servers, where a property of a document has a function
returning a consistent value applied to it in order to determine which
server it will be stored on. The value used for this in elasticsearch
is the document’s _id field by default. The algorithm used to convert
a value to a shard id is what’s known as a consistent hashing
algorithm.
Maintaining good cluster performance is contingent upon even shard
balancing. If data is unevenly distributed across a cluster some
machines will be over-utilized while others will remain mostly idle.
To avoid this, we want as even a distribution of numbers coming out of
our consistent hashing algorithm as possible. Document ids hash well
generally because they are evenly distributed if they are either UUIDs
or monotonically increasing ids (1,2,3,4 …).
This is the default approach, and it generally works well as it solves
the problem of evening out data across the cluster. It also means that
fetches for a single document only need to be routed to the shard that
document hashes to. But what about routing queries? If, for instance,
we are storing user history in elasticsearch, and are using UUIDs for
each piece of user history data, user data will be stored evenly
across the cluster. There’s some waste here, however, in that this
means that our searches for that user’s data have poor data locality.
Queries must be run on all shards within the index, and run against
all possible data. Assuming that we have many users we can likely
improve query performance by consistently routing all of a given
user’s data to a single shard. Once the user’s data has been
so-segmented, we’ll only need to execute across a single shard when
performing operations on that user’s data.
Yes, the number of shards is per index. So if you had 2 indexes, each with 5 shards, then yes, you would have a total of 10 shards distributed across all your nodes.
The number of shards is unrelated to the number of nodes in the cluster. If you have 3 shards and one node, obviously all 3 shards will reside on that one node. However, if you then add an additional node, more shards are not magically created and you can't specify that a certain number of shards should reside on that new node. Rather, the existing shards are distributed as evenly as possible across all nodes resulting in one node with 2 shards and one node with 1 shard, for a total of 3. If you added a third node, then each node would house 1 shard for a total of 3. In other words, the number of shards is fixed and doesn't scale as you add more nodes.
As to your final question, it's based on a false premise, so it's difficult to answer. Rather, lets take the example of above of three shards and two nodes. In that setup, one node will house 2 shards and one node will house 1 shard. If either of those nodes go down, your cluster goes down, because neither has a complete set of shards. The first node is missing 1 shard and the second node is missing 2. This is where replicas come in. By adding replicas, you can ensure that each node ends up with a full set of shards. For example, if you added 1 replica in the above scenario, then the first node would have 2 active shards and 1 replica of the third that lives on the other node. The second node would have 1 active shard and 1 replica each of the two that live on the first. As a result, if either node went down, the cluster can merely activate the replicas and still continue working.
1) Yes, the number of shards is configured per index. It is a static operation and should be done while creating an index. If you want to change the number of shards at a later point of time, you have to reindex the document again and takes time.
2) The number of shards don't depend on number of nodes in you cluster. Lets say you are a book seller website. You have 100 books to sell. your website have an elastic cluster with 3 nodes. you create a book index with 5 shards. Each and very shard contains 20 books. 2 shards will reside on node1, 2 shards will reside in node2 and 1 shard will reside in node3. now let's say node 2 has gone down. But, still we have 2 shards in node 1 and 1 shard in node 3. Querying elastic search will still return the data on node 1 and node 3. i.e, 60 books data will still be available. 40 books data is lost.
But, the overall cluster status will be red meaning cluster is partially functioning, but somedata is not available.
To make the system fault tolerant you can configure replicas. By default elasticsearch makes one replica of each shard. So in this case if the default configuration is not over written the copy of 2 shards on node 2 will be replicated either on node 1 or node 3 and they become the primary shards when node 2 is not available. So all the data is available even when node 2 is down.
in this case the overall cluster health will be yellow, meaning cluster is still functional. But some nodes are lost.
Answer 1) yes you will have 10 shards fr 2 index with 5 shards.
Answer 2) I think you confused with shards and index.
Shards are split piece for index not for node.
If you create a index with 3 shards and 1 replica.
You will get 3 primary shard and 3 replica shards.
If you start new node the shards will be balanced with new node.So you will have 3 shard in old node and 3 shards in new node.
If old node fails you will survive with new node data.It will have exact copy of old node.
To understand basic concepts of elasticsearch refer
HOpe it helps..!

Understanding Elastic Search

Sorry to say this but ES' documentation ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ) is confusing me.
Thanks to the glossary I understand the terms for database, table and row but I have read substantial sections of the documentation and I cannot find answers to:
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html but ironically it leaves those two settings out :/
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a shard is "is a single Lucene instance"?
If I add more nodes later how can I change these values to span the new nodes?
How does sharding work in ES?
How does replica sets work in ES?
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
For reference I read these links first:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html
If that information exists in the documentation then I would be very grateful if you can point me towards it.
Edit:
I am also unsure how auto-discovery works on a distributed network. Short if pinging every public network around how does it connect to the right one that could possibly be on the other side of the world?
Please see below for answers to your points.
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html
but ironically it leaves those two settings out :/
You don't "have" to, but you probably should in especially will want to in production. The default is five shards and one replication.
The number of replications defined is just the number of times your entire index is replicated throughout all of the nodes in your elasticsearch cluster. Think of it as being multiple read copies of a RDBMS database (but in this case, we read and write all copies).
A shard is the number of times I split up, or shard, an index. So, I can have an index with a single shard, or I can have an index with multiple shards. This is similar in concept to sharding a RDBMS database by primary key, but not identical.
So, the total number of shards you will have in an index is the product of number_of_shards and number_of_replicas.
When you do a search, elasticsearch will distribute your search to all possible nodes containing the shards in your index and aggregagate the result for you. You can think of this as a map/ reduce where the map is sending the search out to each shard and the reduce is collecting the results.
Also, you can change the replication number_of_replicas at any time, but you can never change the number_of_shards. This must be set at index creation.
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a
shard is "is a single Lucene instance"?
I think the above mostly answers this, but it's important to remember that elasticsearch is primarily a distributed computing solution to search. We are splitting the work up to multiple shards and possibly machines.
If I add more nodes later how can I change these values to span the new nodes?
Once the cluster is aware of another node in the cluster, no other action is needed by you. The settings propagate throughout the cluster on their own. In your above example of three shards and two replicas, if you had two nodes initially and added a third, each node will have on average two shards per node, this shard movement happens without your intervention (again, provided the cluster is aware of the new node)
How does sharding work in ES?
See above
How does replica sets work in ES?
See above
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
You don't have to "manage" it actively. As stated earlier, sharding and everything else you define at index creation, is propagated to new nodes within the cluster.
You define replicas and shards like this:
{
"settings": {
"index": {
"number_of_shards": 20,
"number_of_replicas": 1
}
},
"mappings": {
"some_type": {
"properties": {
"some_field": {
"type": "long"
}
}
}
}
}
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
You do that through the update indices API, documentation for this specific case is found on there site here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
I just noticed your edit, please see below:
I am also unsure how auto-discovery works on a distributed network.
In the YML config file you set the unicast like this:
discovery.zen.ping.multicast.enabled: false
#discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts: ["ip.add.r.ess", "ip.add.r.ess"]
The middle setting is an important setting, but I commented it out here. That number should always be number of (master nodes / 2) +1. This is to avoid split brain situations. Generally I set all nodes to master eligible.
These settings are for unicast, which is what I think you are going for with your question and not multicast.
In short, an index is broken into shards. Shards can be replicated, meaning multiple copies of the same shard can exist in the same cluster. So if an index has 3 shards and 2 replica's, that means you have nine shards in total of which six are replicas of the three master shards.
ES, will try to balance shards and their replica's across the cluster so that if a node goes down it can fail over from the master shards on that node to replicas. This can confuse some people: a master in elastic search refers to shards, not the actual node. So a single node can have a mix of replica's and master shards.
If you come from the lucene world, a lucene index is not the same thing as an elastic search index. An elastic search index is a logical group of indexed documents with types, mappings and documents. More or less the same as a database schema. A lucene index on the other hand is a group of several files that contains indexed data. When Elastic search creates indexes, what it does is create several lucene indexes (one for each field and shard) and when it replicates, it is basically copying the files of these lucene indices around.
You can't change the number of shards for an index but you can change the number of replicas. Typically what you do when you need to have more shards is create a new index and reindex the data.
In terms of shard management beyond deciding on the number of shards, there's not much to manage by default and ES is pretty good coordinating things by itself, There are a ton of options you can fiddle with once you gain a bit better understanding of how it works. Defaults are pretty OK for most. In terms of cluster management, you can do a lot via the API in terms of shutting down nodes in a controlled way, using index aliases, changing number of replica's, etc.
As for autodiscovery, ES uses local network multicast by default. You can switch to unicast and you probably want to change the default clustername to prevent accidents (had some fun in coffeeshops with unintended clusters forming). You probably don't want to cluster globally. I don't see that ending well.
It's a quite incident that about 80% of your questions are answered in the Video Presentation given by Shay Banon (The creater of ElastiSearch). Though this presentation has much more than you can find anywhere else. Hope this helps.
http://www.infoq.com/presentations/ElasticSearch
This video is a bit low-resolution, so if you want code shown in presentation follow this
https://github.com/kimchy/talks/tree/master/2011/wsnparis

Resources