One day we will find that one shard in our shared index is doing a lot more work than the other shards
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/one-big-user.html
How can we know which particular shard is overload?
/_cat/thread_pool?
/_stats?
/_status?
/_segments?
explain?
In that particular example, I guess the author means that the shard got a lot bigger than the other shards (since a more popular forum would naturally have more content). You can see shard size under _cat/shards.
Or you can look at your Analytics data and deduce that a certain shard gets more searches than other shards. As far as I know, there is no way to directly measure the load of a particular Elasticsearch shard.
Related
I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .
I have designed solr/elasticsearch for searching, I have a particular question. suppose I have 10K search request/seconds. so where will be my search on Shards or replica. I know replica is backup of shards.
if it happens on shards then how/why and if its on replica then how/why ?
Primary Shard is the original copy of data, while the replica shard is a copy of your original data.
While Indexing always happens on the original copy ie primary shards and then copied to replica shards, but the search can happen on any of the copy irrespective of original or copy of data.
Hence replicas are not only created for fault-tolerance where if you lose one copy, it can recover from copy of it, But also to improve the search performance where if one shard is overloaded (primary or replica) then search happens on the least loaded copy ie another replica.
Please refer to Adaptive replica selection in ES on how/why replicas improve the search latency.
Feel free to let me know if you need more information.
EDIT based on OP comment:
From ES 7 adaptive replica selection is by default on, so it would send to a least loaded replica but even if all shards are underutilized still it wouldn't send all search requests to primary shards to avoid overloading it. Also before ARS(adaptive replica selection), ES used to send these search requests on round-robin fashion to avoid overloading one shard.
I tried searching for an answer to my question but I couldn't find any, this is my first time dealing with big data and Elasticsearch, I'm trying to learn how Elasticsearch works by going through there online tutorial, while reading I came across the topic for shrinking indices and how that can be done, OK now I know how to do it but unfortunately I don't know why I need to do it?
Why do I need to shrink my index and decrease my shards? is it space related change or what?
Every Elasticsearch index consists of multiple shards (default 5), which are each a Lucene index. Each one of these has an overhead (in terms of memory, file handles,...) but allow more parallelization. In case you don't need that much parallelization any more at some point — think of a daily index for logs and after a few days there won't be more writes any more and only few reads — you might want to reduce the number of shards to cut down on their overhead.
The number of shards is tied to query performance in the following way:
How does shard size affect performance?
In Elasticsearch, each query is executed in a single thread per shard.
Multiple shards can however be processed in parallel, as can multiple
queries and aggregations against the same shard.
This means that the minimum query latency, when no caching is
involved, will depend on the data, the type of query, as well as the
size of the shard. Querying lots of small shards will make the
processing per shard faster, but as many more tasks need to be queued
up and processed in sequence, it is not necessarily going to be faster
than querying a smaller number of larger shards. Having lots of small
shards can also reduce the query throughput if there are multiple
concurrent queries.
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
I am curious about the impact of #shards in Elasticsearch. Particularly, I am looking for the pros and cons for having big #shards and small #shards.
For example, I have a two-node cluster. Assuming replica is one, for one index, should I create two shards which are spread across these two nodes? Or should I use the default i.e., 5 shards? My thinking is the first one.
It seems to me there is no reason to have more than one share per node per index as one Lucene instance can have better caching than a few Lucene instances.
Edit:
Let's say I have only one node and want to create one index. How does the shard number affect the performance in this case? My thinking is I should have only one shard in such a case. Is that right?
You can find the answer in the elasticsearch documentation here
I'm new to ElasticSearch.
Lets suppose I've 10000 documents. The relevant field in the documents are such that after getting indexed most of them would end up in a single shard.
Would ElasticSearch rebalance this "skewed" distribution for, may be better load balancing?
If I got you question right, the short answer - no, the documents will not be relocated. Choosing shard is based on modulo-like distribution, and its used for index as well as for retrieval.
So, if (theoretically) ES will rebalance such docs, you'll be unable to retrieve them with you routing key, as it will leads to original shard (which is empty in such theoretical case).
The "distribution" part of docs if nice place for further reading
I don't exactly understand what you mean by this "the relevant field in the documents are such that after getting indexed most of them would end up in a single shard".
From what I understand, ElasticSearch will automatically balances the shards between all the nodes started on your setup to be the most effective possible.
The document are indexed on a shard with the field. The same document cannot have some fields on node 1 and some other fields on node 2.