Generally speaking, which are the tradeoffs (in terms of performance and memory usage) between large and small indexes in Elasticsearch?
Elaborating a little:
Consider a cluster with 8 nodes, each node with 1 shard and 30Gb allocated to the JVM.
Consider also a scenario with 50 million of documents per day (all with the same structure and using doc-values), retained for 90 days. Each day of documents has about 35Gb on disk.
I want to run some queries in these cluster, covering a total of 12 hours of data.
These queries are composed by some nested aggregations: a date-histogram, followed by a cardinality and a percentile aggregation.
Considering the amount of data, which is better: use daily-indexes or only a single index?
PS: I know that is a "vague" question. My question is more theoretical.
I want to understand better what occur during an aggregation and how this relates to the number of indexes.
Related
I have an Elasticsearch cluster of 1 replica, 2 nodes, and in total 2 shards. One of my indexes calls 'products' is massive and it contains around 7 million records and it costs around 56GB. So we are planning to split the index into 5 shards for each replica(a total of 10 shards) to increase search speed.
I'm looking for the perfect amount of shards for the replicas as a suggestion to try out and any other recommendations/tips to increase the search speed for this infrastructure.
Also, there's a multi-search query that is slower than others. Hoping for some good suggestions/tips for that too.
Thank you!
generally speaking - fewer shards are better for reading (ie searching), more are better for indexing
a 56 gig index in Elasticsearch is not that big, but then it's all relative to your use case. I am not sure that increasing the shard count is worth it here without understanding your use case and setup
your best bet would be to use Elasticsearch Rally to do some testing with your data and your queries and then figure out what configuration works the best for your use case - https://esrally.readthedocs.io/en/stable/
I am designing a search system based on ElasticSearch, after reading a lot I have seen that some systems such as logs use a policy of multiple indexes to save the same content, similar to mylogs-12-02-2020 and are creating an index by day, then to search, they perform the searches in all the indices that comply with the mylogs- * pattern, each of those indices has its primary shards and replicas.
My question would be regarding the performance of the searches, which would be more performant to look at an index of 5 million documents, with n shards or look for 50 indexes of 100,000 documents. Does anyone have any experience with the best practice to follow?
I am assuming that my system will have an approximate growth of 200,000 documents per day.
What is the best practice, separate in multiple indexes or have a single index with several primary shards in different nodes (so that they do not compete for the same resources when searching / indexing)?
When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Elasticsearch default configuration given by #Umar is old and starting with 7.0 ES latest major version, Primary shards reduced to 1, you can check this in ES official breaking changes announcement.
Nobody can design the perfect ES index with optimal no of shards and replicas and required continuous fine-tuning over the period. Some factors which affect the design consideration.
Read or Write-heavy system.
Time-based indices(like your log searches) where normally searches happen on more recent logs or e-commerce product catalog or website search where you can't divide indices into time-based data.
ES cluster(multi-tenant vs dedicated to single index).
Above are just a few samples and I can go can give 100s of other factors, which you can consider while designing your ES index configuration. But the idea is to start with more crucial params first(like changing primary shards requires re-indexing) also consider the near-future growth and fine-tune later on based on current system performance.
I would strongly suggest you go through my detailed blog which would answer your questions about(searching in one index with more docs than searching in more indices/shards with fewer docs) in detail through a real-world case study.
The above blog also explains the ES decision to change the longtime default primary shards from 5 to 1.
Answer to your below question:
Question: When doing a search on mylogs-* elastic does it parallel to the indexes and within each index in its shards?
Answer: Yes, ES has distributed architecture and as ES index is made of Lucene shard which is a full-blown search engine, Every ES query would be executed by multiple threads in parallel if it needs to hit multiple shards(whether of same index or multiple indices), Given threads are free, otherwise once a thread finish, it would be then be used to query another shard. this is why ES is much faster like other distributed systems.
By default, an Elasticsearch index has 5 primary shards and 1 replica for each. But the problem is default configurations are not suitable for every use case.
Shard size is quite critical for search queries. If there would be too many shards that are assigned to an index, Lucene segments would be small which causes an increase in overhead. Lots of small shards would also reduce query throughput when multiple queries are made simultaneously. On the other hand, too large shards cause a decrease in search performance and longer recovery time from failure. Therefore, it is suggested by Elasticsearch that one shard’s size should be around 20 to 40 GB.
Keep in mind it is the shard that acts as a separate search engine in itself, not the index. indices are a type of data organization mechanism, allowing the user to partition data a certain way. that is all!
For further details read this article.
I am currently playing around with Elasticsearch (ES). We are ingesting sensor data and for 3 years we have approximately 1,000,000,000 documents in one index, making the index about 50GB in size. Indexing performance is not that important as new data only arrives every 15 minutes per sensor on average, therefore I want to focus on searching and aggregating performance. We are running a front-end showing basically a dashboard about average values from last week compared to one year before etc.
I am using ES on AWS and after performance on one machine was quite slow, I spun up a cluster with 3 data nodes (each 2 cores, 8 GB mem), and gave the index 3 primary shards and one replica. Throwing computing power at the data certainly improved the situation and more power would help more, but my question is:
Would splitting the index for example by month increase the performance? Or being more specific: is querying (esp. by date) a smaller index faster if I adjust the queries adequatly, or does ES already 'know' where to find specific dates in a shard?
(I know about other benefits of having smaller indices, like being able to roll over and keep only a specific time interval, etc.)
1/ Elasticsearch only knows where to find a specific date in an index if your index is sorted by your date field. You can check the documentation here.
In your use case, it can improve drastically search performance. And since all the data will be added at the "end of the index" since its date sorted, you should not see much of indexation overhead.
2/ Without index sort, smaller time-bounded indices will work better (even if you target all your indices) since it will often allow a rewrite or your range query to a match_all / match_none internal query.
For more information about this behavior you should read this blog post :
Instant Aggregations: Rewriting Queries for Fun and Profit
I have more than 4000 different fields in one of my index. And that number can grow larger with time.
As Elasticsearch give default limit of 1000 field per index. There must be some reason.
Now, I am thinking that I should not increase the limit set by Elasticsearch.
So I should break my single large index into small multiple indexes.
Before moving to multiple indexes I have few questions as follows:
The number of small multiple indexes can increase up to 50. So searching on all 50 index at a time would slow down search time as compared to a search on the single large index?
Is there really a need to break my single large index into multiple indexes because of a large number of fields?
When I use small multiple indexes, the total number of shards would increase drastically(more than 250 shards). Each index would have 5 shards(default number, which I don't want to change). Search on these multiple indexes would be searching on these 250 shards at once. Will this affect my search performance? Note: These shards might increase in time as well.
When I use Single large index which contains only 5 shards and a large number of documents, won't this be an overload on these 5 shards?
It strongly depends on your infrastructure. If you run a single node with 50 Shards a query will run longer than it would with only 1 Shard. If you have 50 Nodes holding one shard each, it will most likely run faster than one node with 1 Shard (if you have a big dataset). In the end, you have to test with real data to be sure.
When there is a massive amount of fields, ES gets a performance problem and errors are more likely. The main problem is that every field has to be stored in the cluster state, which takes a toll on your master node(s). Also, in a lot of cases you have to work with lots of sparse data (90% of fields empty).
As a rule of thumb, one shard should contain between 30 GB and 50 GB of data. I would not worry too much about overloading shards in your use-case. The opposite is true.
I suggest testing your use-case with less shards, go down to 1 Shard, 1 Replica for your index. The overhead from searching multiple Shards (5 primary, multiply by replicas) then combining the results again is massive in comparison to your small dataset.
Keep in mind that document_type behaviour changed and will change further. Since 6.X you can only have one document_type per Index, starting in 7.X document_type is removed entirely. As the API listens at _doc, _doc is the suggested document_type to use in 6.X. Either move to one Index per _type or introduce a new field that stores your type if you need the data in one index.
I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.
Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time