Elastic search results are inconsistent - query on multiple shards - elasticsearch

Setup
Fallowing is my ES setup.
Using Elastic Cloud
Have 3 shard with 3 replicas
Size is 5 GB(3.2 millions documents)
Problem Statement
While performing the wildcard search, its giving a different result each time. I believe that the search is going to different shards and giving the fastest result first(score is same) .
If I make my index with single shard instead of 3 shards for 3.2 million records(5 GB), will it impact the performance?
or
What is the other best way to query multiple shards with the same result all the time with faster response time (not the priority).
PS
I've gone through the below article and I didn't get clear idea.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-preference.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
Thanks in advance.

Related

Any suggestions/recommendations to improve Elasticsearch search speed?

I have an Elasticsearch cluster of 1 replica, 2 nodes, and in total 2 shards. One of my indexes calls 'products' is massive and it contains around 7 million records and it costs around 56GB. So we are planning to split the index into 5 shards for each replica(a total of 10 shards) to increase search speed.
I'm looking for the perfect amount of shards for the replicas as a suggestion to try out and any other recommendations/tips to increase the search speed for this infrastructure.
Also, there's a multi-search query that is slower than others. Hoping for some good suggestions/tips for that too.
Thank you!
generally speaking - fewer shards are better for reading (ie searching), more are better for indexing
a 56 gig index in Elasticsearch is not that big, but then it's all relative to your use case. I am not sure that increasing the shard count is worth it here without understanding your use case and setup
your best bet would be to use Elasticsearch Rally to do some testing with your data and your queries and then figure out what configuration works the best for your use case - https://esrally.readthedocs.io/en/stable/

Elasticsearch Latency

I am using Elasticsearch's MultiSearch API to make multiple search requests at once for one of my endpoints. My understanding is that these requests are done in parallel, but my endpoint's latency increases with the number of search requests I make through the API (<50). I have two questions:
Why is this latency increase happening/how does multisearch work behind the scenes? I am new to Elasticsearch, apologies for my lack of knowledge here.
What are some ways I can improve latency while keeping multisearch?
To provide a more comprehensive answer, it would be good to know your cluster setup.
These requests are indeed done in parallel, but your cluster still has its limits.
What I believe might be happening is that you might not have enough search threads to process that many searches in parallel and your search thread pool start queueing.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
So for instance, if you issue a MultiSearch query of let's say 10 search queries where each query would hit 15 shards, this means that this whole query will need 150 search threads in total. And if there are other searches running and the cluster doesn't have available search threads - they will start queueing and eventually might reject if the queue grows too big.
What can you do about it?
Carefully review indices setups, their number_of_shards of shards, and indices sizes. Reducing the number_of_shards will require fewer search threads. Find a balance between number_of_shards and index sizes and their doc count. If there are less than 5M documents, keep everything in a single shard, otherwise, try to have shards of 3M-5M documents, e.g. index with 23M documents could use 5 or 6 shards.
Scale your cluster horizontally by adding new nodes, this will add new search threads
Tweak default thread pool settings (this is mostly the last thing you'd do)

Elasticsearch index policy creation best practice/performance

I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.

How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

Resources