ElasticSearch handling for max shard size - elasticsearch

I learnt that, an ES Shard is nothing but a lucene index and that Max items in Lucene Index can be INT.MAX -128 (Approx 2Billion), but I could not find anywhere on ES reference how is this scenario handled? Does ES fail or assign another shard to documents with same route?
or is it something that we need to plan in advance, while designing the indexing strategies?

Related

Lucene and Elasticsearch going past the document limit

What happens when we try to ingest more documents into 'Lucene' instance past its max limit of 2,147,483,519?
I read that as we approach closer to 2 billion documents we start seeing performance degradation.
But does 'Lucene' just stop accepting new documents past its max limit.
Also, how does 'Elasticsearch' handle the same scenario for one of its shard when it's document limit is reached.
Every elasticsearch shard under the hood is Lucene Index, so this limit is applicable to Elasticsearch shard as well, and based on this Lucene issue it looks like it stops indexing further docs.
Performance degradation is subject to several factors like the size of these docs, JVM allocated to the Elasticsearch process (~32 GB is a max limit), and available file system cache which is used by Lucene and no of CPU, network bandwidth etc.

Elasticsearch index policy creation best practice/performance

I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.

How to get better relevance without compromising on performance, scalability and avoid the sharding effect of Elasticsearch

Let's suppose I have a big index, consists 500 million docs and by default, ES creates 5 primary shards for below reasons and I also go with the same setting.
Performance:- There will be less time to search in a shard with less no of documents(100 million in my use case) than in just 1 shard with a huge number of documents(500 million). Also, allows to distribute and parallelize operations across shards.
Horizontal scalability(HS) :- horizontally split/scale your content volume.
But when we search by default it just goes to 1 shard and gives the result. in this case, relevance isn't accurate(as idf be majorly impacted) and also it might even not give any result if my matched document is on another shard. and its called as The Sharding Effect.
Above issue is explained in details here and there are below 2 options to avoid this issue but I think both the solutions have some cons :-
1. Document routing: I this case all the documents will be on the same shards which lose the whole purpose of sharding.
2. dfs_query_then_fetch search type: there is performance cost associated with it.
I am interested to know below:
What ES does by default? or is there is any config by which it can be controlled?
Is there is other Out of the box solution which ES provides to avoid the sharding effect?
first of all this part of your question if not accurate :
But when we search by default it just goes to 1 shard and gives the
result. in this case, relevance isn't accurate(as idf be majorly
impacted) and also it might even not give any result if my matched
document is on another shard. and its called as The Sharding Effect.
The bold part is false. The search request is sent to all shards ( of course, or no one would use elasticsearch !) but the score is computed on shard basis. So yes you can have an accuracy problem with multiple shards but only if you have very few documents. With 500 million the accuracy will not be a problem ( unless you u make a bad usage of document routing see here for more informations
So when you search for 10 results for a query, each shard return the 10 best matches for the query, then the results from the shards are aggregated by the coordination node to give the best 10 results for the whole index.
You can use 5 shards without fearing any relevancy problem. But don't try to avoid sharding effect! It is what makes elasticsearch so cool :D

Elasticsearch Document Count Doesn't Reflect Indexing Rate

I've indexing data from Spark into Elasticsearch, and according the Kibana, I'm indexing at a rate of 6k/s for the primary shards. However, if you look at the Document Count graph in the lower right, you'll see that it doesn't increase proportionately. How can this index have only 1.3k documents when it's indexing at 5 times that per second?

Multi-index search vs single index search in elastic search

I have a large number of entities of the same type, each having a large number of attributes and I only have these two choices to store them :
Storing each item as an in an index and perform multi-index search
Storing all of enties in a single index and search only 1 index.
Generally i want a comparison between time complexity of searching "n" entities with "m" features in each of the above cases !
The answer lies within the Elasticsearch documentation:
Searching 1 index of 50 shards is exactly equivalent to searching 50
indices with 1 shard each: both search requests hit 50 shards.
If you wish to learn about how shards are allocated on your nodes and how they interact with your index setup, I would suggest this stackoverflow question as well as the Elasticsearch documentation for scaling.

Resources