I have an Elasticsearch cluster of 1 replica, 2 nodes, and in total 2 shards. One of my indexes calls 'products' is massive and it contains around 7 million records and it costs around 56GB. So we are planning to split the index into 5 shards for each replica(a total of 10 shards) to increase search speed.
I'm looking for the perfect amount of shards for the replicas as a suggestion to try out and any other recommendations/tips to increase the search speed for this infrastructure.
Also, there's a multi-search query that is slower than others. Hoping for some good suggestions/tips for that too.
Thank you!
generally speaking - fewer shards are better for reading (ie searching), more are better for indexing
a 56 gig index in Elasticsearch is not that big, but then it's all relative to your use case. I am not sure that increasing the shard count is worth it here without understanding your use case and setup
your best bet would be to use Elasticsearch Rally to do some testing with your data and your queries and then figure out what configuration works the best for your use case - https://esrally.readthedocs.io/en/stable/
Related
I know that the shard count and size has a significant impact on the search performance (speed) and cluster recovery.
Does the total number of shard count impact the search speed? Let me simplify it assume I have 5 indices with 5 primary shards each and I am searching in indice1 only and assume it return me the response in 500ms. Will this be same (500ms) if I add 5 more indices? I know the recovery time would increase but not sure about a specific indice search performance.
Any help would be highly appricated.
Common sense would imply that searching on more data takes longer, however,
it's impossible to answer without also knowing:
the number of nodes (more nodes can parallelize searches on several shards)
their hardware specs (ram and cpu play a role in how many concurrent searches can happen on a single node)
if any write operations also happen at the same time (taking resources away from search threads)
etc...
The best you can do is to actually create a test case (using e.g. Rally) and test this on your own infrastructure.
I am using Elasticsearch's MultiSearch API to make multiple search requests at once for one of my endpoints. My understanding is that these requests are done in parallel, but my endpoint's latency increases with the number of search requests I make through the API (<50). I have two questions:
Why is this latency increase happening/how does multisearch work behind the scenes? I am new to Elasticsearch, apologies for my lack of knowledge here.
What are some ways I can improve latency while keeping multisearch?
To provide a more comprehensive answer, it would be good to know your cluster setup.
These requests are indeed done in parallel, but your cluster still has its limits.
What I believe might be happening is that you might not have enough search threads to process that many searches in parallel and your search thread pool start queueing.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
So for instance, if you issue a MultiSearch query of let's say 10 search queries where each query would hit 15 shards, this means that this whole query will need 150 search threads in total. And if there are other searches running and the cluster doesn't have available search threads - they will start queueing and eventually might reject if the queue grows too big.
What can you do about it?
Carefully review indices setups, their number_of_shards of shards, and indices sizes. Reducing the number_of_shards will require fewer search threads. Find a balance between number_of_shards and index sizes and their doc count. If there are less than 5M documents, keep everything in a single shard, otherwise, try to have shards of 3M-5M documents, e.g. index with 23M documents could use 5 or 6 shards.
Scale your cluster horizontally by adding new nodes, this will add new search threads
Tweak default thread pool settings (this is mostly the last thing you'd do)
Setup
Fallowing is my ES setup.
Using Elastic Cloud
Have 3 shard with 3 replicas
Size is 5 GB(3.2 millions documents)
Problem Statement
While performing the wildcard search, its giving a different result each time. I believe that the search is going to different shards and giving the fastest result first(score is same) .
If I make my index with single shard instead of 3 shards for 3.2 million records(5 GB), will it impact the performance?
or
What is the other best way to query multiple shards with the same result all the time with faster response time (not the priority).
PS
I've gone through the below article and I didn't get clear idea.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-preference.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
Thanks in advance.
I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.
Generally speaking, which are the tradeoffs (in terms of performance and memory usage) between large and small indexes in Elasticsearch?
Elaborating a little:
Consider a cluster with 8 nodes, each node with 1 shard and 30Gb allocated to the JVM.
Consider also a scenario with 50 million of documents per day (all with the same structure and using doc-values), retained for 90 days. Each day of documents has about 35Gb on disk.
I want to run some queries in these cluster, covering a total of 12 hours of data.
These queries are composed by some nested aggregations: a date-histogram, followed by a cardinality and a percentile aggregation.
Considering the amount of data, which is better: use daily-indexes or only a single index?
PS: I know that is a "vague" question. My question is more theoretical.
I want to understand better what occur during an aggregation and how this relates to the number of indexes.