How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch - elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?

first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

Related

Elastic search results are inconsistent - query on multiple shards

Setup
Fallowing is my ES setup.
Using Elastic Cloud
Have 3 shard with 3 replicas
Size is 5 GB(3.2 millions documents)
Problem Statement
While performing the wildcard search, its giving a different result each time. I believe that the search is going to different shards and giving the fastest result first(score is same) .
If I make my index with single shard instead of 3 shards for 3.2 million records(5 GB), will it impact the performance?
or
What is the other best way to query multiple shards with the same result all the time with faster response time (not the priority).
PS
I've gone through the below article and I didn't get clear idea.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-preference.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
Thanks in advance.

Elasticsearch index policy creation best practice/performance

I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.

How to let elasticsearch coordinate node don't merge and resort

For example, the ES cluster has 3 shards, a query wants to get 300 docs.
Normally, the coordinate node will get 300 docs from each shard, that's 3*300=900 docs in total, then coordinate node sort these 900 docs and return top 300 docs.
How can I set the query, let coordinate node get 100 docs from each shard and return 3*100=300 docs?
Am curious why you would like every single shard to only return an equally sized share/slice of the resulting hits, as it is very unlikely that the 300 most relevant/important hits are evenly distributed across all shards.
The coordinating node's task is not just to return 300 hits, but the 300 most relevant/important hits. By default hits are sorted by descending score (unless you specify a different sorting criteria). Statically considering 100 hits from every single shard could result in a total meaningless result list.
An example: for simplicity, assume that your index is only made up of 2 (primary) shards and you contains documents about mobile phone news back in early 2007. It's very likely that you have many documents in your index about Windows, Nokia and Blackberry phones. And then, all of a sudden the iPhone got announced and articles start popping up. Let's further assume that short ofter the presentation of the phone there have been 100 very relevant articles about iPhones been published and indexed in your Elasticsearch index and now you are querying for the best 100 hits about iPhones. With coordinating nodes "optimized" the way you are asking for, the first fifty documents would get retrieved from both shards. As a consequence it will be very likely that you only end up having something like 60-70 of the relevant articles in your result set, and the other 30-40 very relevant hits are missing (and even worse, 30-40 articles are rather irrelevant and just made it in, because they mentioned the term iPhone once).
Actually, the coordinating nodes are also "smart" and under certain conditions can skip shards when it's guaranteed that they they don't contain any matching document.
Furthermore, if you don't deal with big data and all your documents easily fix into a single, configure your index to be made up of 1 shard and the coordinating node does not need to do any merging.
If your use-case does not rely on relevancy at all, you could think of organizing your data in different indices (rather than multiple shards within an index). Then you can query every single index independently for the first n hits, and merge the results on application side. But as this involves more network roundtrips, it eventually might be even slower.

Advice on efficient ElasticSearch document design

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)
In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).
I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

Primary/Replica Inconsistent Scoring

We have a cluster with 3 primary shards and 2 replicas per primary. The total doc count is the same for the primary/replica shards; however, we're getting 3 distinct scores for the same query/document. When we add preference = primary as a query parameter, we get consistent scores each time.
The only explanation we can think of is different DF counts between the primary/replicas. Where is the inconsistency between the primary/replica shards, and how does one go about fixing this? We're using 1.4.2.
EDIT:
We just reindexed the doctype we were querying, but there's still inconsistent scoring.
Primary and replica shards have a different "path" when it comes to segment merging. Meaning, the number and size of the segments can differ between them. Each shared takes care of its own segments independent from other shards.
Why this matters when it comes to calculating score, is because merging is the moment when the documents that were deleted are actually deleted. Until then, deleted documents are only marked as deleted (and taken out from the query results after the query already ran). So, this means it can influence the algorithm by which the score is calculated.
To be more specific - total number of docs in a shard is used for the [IDF calculation](http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#idf(long, long)) and for document frequency (docFreq):
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0)
And this number of docs include the deleted (marked as deleted, to be more precise) documents. Take, also, a look at this github issue and Simon's comments regarding the same subject.

Resources