what is the impact of shard number in elasticsearch? - elasticsearch

I am curious about the impact of #shards in Elasticsearch. Particularly, I am looking for the pros and cons for having big #shards and small #shards.
For example, I have a two-node cluster. Assuming replica is one, for one index, should I create two shards which are spread across these two nodes? Or should I use the default i.e., 5 shards? My thinking is the first one.
It seems to me there is no reason to have more than one share per node per index as one Lucene instance can have better caching than a few Lucene instances.
Edit:
Let's say I have only one node and want to create one index. How does the shard number affect the performance in this case? My thinking is I should have only one shard in such a case. Is that right?

You can find the answer in the elasticsearch documentation here

Related

How many indexes can I create in elastic search?

I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .

Elasticsearch maximum index count limit

Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue

What are the ideal number of shards count for 3 node elasticsearch setup

I wanted to know how many primary shards and replicas are ideal to a three node cluster and wanted to know the rule of thumb to set the Primary shard and replicas depending on the servers. How can we change the number of shards safely in elasticsearch cluster. (Elasticsearch has great documentation but it was hard for me to get a answer for this, my bad). Thanks in advance
The ideal number of shards and replicas will all depend on how much data is in each index and how you want to query it. My recommendation would be to leave all the shards and replicas at their default values until such time as you need to tune them. This is what ES production support is ideal for.

Understanding Elastic Search

Sorry to say this but ES' documentation ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ) is confusing me.
Thanks to the glossary I understand the terms for database, table and row but I have read substantial sections of the documentation and I cannot find answers to:
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html but ironically it leaves those two settings out :/
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a shard is "is a single Lucene instance"?
If I add more nodes later how can I change these values to span the new nodes?
How does sharding work in ES?
How does replica sets work in ES?
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
For reference I read these links first:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/glossary.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html
If that information exists in the documentation then I would be very grateful if you can point me towards it.
Edit:
I am also unsure how auto-discovery works on a distributed network. Short if pinging every public network around how does it connect to the right one that could possibly be on the other side of the world?
Please see below for answers to your points.
Why do I need do to add number_of_shards and number_of_replicas to index creation? I did look here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html
but ironically it leaves those two settings out :/
You don't "have" to, but you probably should in especially will want to in production. The default is five shards and one replication.
The number of replications defined is just the number of times your entire index is replicated throughout all of the nodes in your elasticsearch cluster. Think of it as being multiple read copies of a RDBMS database (but in this case, we read and write all copies).
A shard is the number of times I split up, or shard, an index. So, I can have an index with a single shard, or I can have an index with multiple shards. This is similar in concept to sharding a RDBMS database by primary key, but not identical.
So, the total number of shards you will have in an index is the product of number_of_shards and number_of_replicas.
When you do a search, elasticsearch will distribute your search to all possible nodes containing the shards in your index and aggregagate the result for you. You can think of this as a map/ reduce where the map is sending the search out to each shard and the reduce is collecting the results.
Also, you can change the replication number_of_replicas at any time, but you can never change the number_of_shards. This must be set at index creation.
How can I have 3 shards with 2 replicas? If the glossary is anything to go by shouldn't that be impossible considering that a
shard is "is a single Lucene instance"?
I think the above mostly answers this, but it's important to remember that elasticsearch is primarily a distributed computing solution to search. We are splitting the work up to multiple shards and possibly machines.
If I add more nodes later how can I change these values to span the new nodes?
Once the cluster is aware of another node in the cluster, no other action is needed by you. The settings propagate throughout the cluster on their own. In your above example of three shards and two replicas, if you had two nodes initially and added a third, each node will have on average two shards per node, this shard movement happens without your intervention (again, provided the cluster is aware of the new node)
How does sharding work in ES?
See above
How does replica sets work in ES?
See above
How can I manage sharding? I understand it is auto join ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html#cluster-name ) but how do I define the difference between replicas and shards?
You don't have to "manage" it actively. As stated earlier, sharding and everything else you define at index creation, is propagated to new nodes within the cluster.
You define replicas and shards like this:
{
"settings": {
"index": {
"number_of_shards": 20,
"number_of_replicas": 1
}
},
"mappings": {
"some_type": {
"properties": {
"some_field": {
"type": "long"
}
}
}
}
}
How can I manage replica sets? I.e. how do I add replicas, promote primaries etc?
You do that through the update indices API, documentation for this specific case is found on there site here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html
I just noticed your edit, please see below:
I am also unsure how auto-discovery works on a distributed network.
In the YML config file you set the unicast like this:
discovery.zen.ping.multicast.enabled: false
#discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts: ["ip.add.r.ess", "ip.add.r.ess"]
The middle setting is an important setting, but I commented it out here. That number should always be number of (master nodes / 2) +1. This is to avoid split brain situations. Generally I set all nodes to master eligible.
These settings are for unicast, which is what I think you are going for with your question and not multicast.
In short, an index is broken into shards. Shards can be replicated, meaning multiple copies of the same shard can exist in the same cluster. So if an index has 3 shards and 2 replica's, that means you have nine shards in total of which six are replicas of the three master shards.
ES, will try to balance shards and their replica's across the cluster so that if a node goes down it can fail over from the master shards on that node to replicas. This can confuse some people: a master in elastic search refers to shards, not the actual node. So a single node can have a mix of replica's and master shards.
If you come from the lucene world, a lucene index is not the same thing as an elastic search index. An elastic search index is a logical group of indexed documents with types, mappings and documents. More or less the same as a database schema. A lucene index on the other hand is a group of several files that contains indexed data. When Elastic search creates indexes, what it does is create several lucene indexes (one for each field and shard) and when it replicates, it is basically copying the files of these lucene indices around.
You can't change the number of shards for an index but you can change the number of replicas. Typically what you do when you need to have more shards is create a new index and reindex the data.
In terms of shard management beyond deciding on the number of shards, there's not much to manage by default and ES is pretty good coordinating things by itself, There are a ton of options you can fiddle with once you gain a bit better understanding of how it works. Defaults are pretty OK for most. In terms of cluster management, you can do a lot via the API in terms of shutting down nodes in a controlled way, using index aliases, changing number of replica's, etc.
As for autodiscovery, ES uses local network multicast by default. You can switch to unicast and you probably want to change the default clustername to prevent accidents (had some fun in coffeeshops with unintended clusters forming). You probably don't want to cluster globally. I don't see that ending well.
It's a quite incident that about 80% of your questions are answered in the Video Presentation given by Shay Banon (The creater of ElastiSearch). Though this presentation has much more than you can find anywhere else. Hope this helps.
http://www.infoq.com/presentations/ElasticSearch
This video is a bit low-resolution, so if you want code shown in presentation follow this
https://github.com/kimchy/talks/tree/master/2011/wsnparis

When do you start additional Elasticsearch nodes? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.
I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.
I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.
Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.
When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?
Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?
Let's clarify the terminology a little first:
Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine.
Cluster: one or more nodes with the same cluster name.
Index: more or less like a database.
Type: more or less like a database table.
Shard: effectively a lucene index. Every index is composed of one or more shards. A shard can be a primary shard (or simply shard) or a replica.
When you create an index you can specify the number of shards and number of replicas per shard. The default is 5 primary shards and 1 replica per shard. The shards are automatically evenly distributed over the cluster. A replica shard will never be allocated on the same machine where the related primary shard is.
What you see in the cluster status is weird, I'd suggest to check your index settings using the using the get settings API. Looks like you configured only one shard, but anyway you should see more shards if you have more than one index. If you need more help you can post the output that you get from elasticsearch.
How many shards and replicas you use really depends on your data, the way you access them and the number of available nodes/servers. It's best practice to overallocate shards a little in order to redistribute them in case you add more nodes to your cluster, since you can't (for now) change the number of shards once you created the index. Otherwise you can always change the number of shards if you are willing to do a complete reindex of your data.
Every additional shard comes with a cost since each shard is effectively a Lucene instance. The maximum number of shards that you can have per machine really depends on the hardware available and your data as well. Good to know that having 100 indexes with each one shard or one index with 100 shards is really the same since you'd have 100 lucene instances in both cases.
Of course at query time if you want to query a single elasticsearch index composed of 100 shards elasticsearch would need to query them all in order to get proper results (unless you used a specific routing for your documents to then query only a specific shard). This would have a performance cost.
You can easily check the state of your cluster and nodes using the Cluster Nodes Info API through which you can check a lot of useful information, all you need in order to know whether your nodes are running smoothly or not. Even easier, there are a couple of plugins to check those information through a nice user interface (which internally uses the elasticsearch APIs anyway): paramedic and bigdesk.

Resources