How to reduce ElasticSearch lucene segments without force merge - elasticsearch

We have a cluster which stores 1.5m records, with a total size of 3.5GB. Every 30 minutes around 2-5k records get updated or created. Up until now after mass indexing the pre-existing data we were force merging to bring the amount of segments down from 30-35 to 1, which greatly increases the performance of the search. After a few days the amount of segments normally rises and levels out at about 7 or 8, and performance is still ok.
The issue with this is we plan to scale our data to around 80GB. If we do so my concern is by using force merge after the initial mass index the segment will be larger than 5GB, at which point it will not be considered for automatic merging by ElasticSearch, and performance will decrease. Without using force merge though I believe the amount of segments will be too high.
Is there a way to force elastic search to more aggressively optimize, without calling the force merge API? We have no users in the evenings or weekends, so ideally we could mass index, and then give it all weekend to optimize the segments to a lower number, with no concern for search performance during that time.

Related

Elastic Search force merging and merging(to do or not to do force merge)

I have a scenario in which we have indices for each month, and each document has a field, say expiry date, and according to that expiry date, documents will be deleted when they reach their expiry. One-month-old index will be moved to a warm node from a hot node(All my queries below will be pertaining to the indexes that are on warm nodes).
Now I understand Elastic search will merge the segments as needed.
here is my first question, how will elastic search determine that now is the need to merge the segments?
I have come across a property index.merge.policy.expunge_deletes_allowed which has a default value of 10%, does this property dictate when the merging will happen? And it says 10% deleted document, what does that exactly mean? let's suppose if a segment has 100 documents and I deleted 11 of them(that happens to be on the same segment) does that mean the default limit of 10% has been met?
Coming back to the scenario when my documents get deleted at some point there will come a time when all the documents in an index get deleted. What will segments of that index look like then? will it have 0 segments or just 1 to hold index metadata?
Another question regarding the force merge is if I happen to choose force merge to get rid of all the deleted documents from the disk and if force merging resulted in a segment of size greater than 5 GB so as written here. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html#forcemerge-api-desc
Snippet :
Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.
When will my segment(greater than 5GB will get merged automatically if at all?) as it says it will when it consists mostly of deleted documents? That's vague what does mostly mean here? what's the threshold?
Another question is, it is suggested that force merge should only be done on indexes that are read-only. Why is that? how does it degrade performance? coming back to the scenario I will have some updates and new documents coming on my indexes on warm nodes even after I force merge them, but the frequency of those updates and new documents will be very less(we can say less than 5% of the documents will be updated, and less than and maybe a couple of hundred new documents could be added to those indexes).
Also what if I am force-merging 4 450GB indexes( each with 16 shards) in parallel, how will it affect my searching speed? I read somewhere that by default each force merge request is executed in a single thread and that too is throttled if need be? does that mean if search requests increases the merging will be paused?
Thank you for your patience and time.

Does huge number of deleted doc count affects ES query performance

I have few read heavy indices(started seeing performance issues on these indices) in my ES cluster which has ~50 million docs and noticed most of them have around 25% of total documents as deleted, I know that these deleted document count decrease over time when background merge operation happens, But in my case these count is always around ~25% of total documents and I have below questions/concerns:
Will these huge no of deleted count affects the search performance as they are still part of lucene immutable segments and search happens to all the segments and latest version of document is returned, so size of immutable segments would be high as they contains huge number of deleted docs and then another operation to figure out the latest version of doc.
Will periodic merge operation would take lot of time and inefficient if huge number of deleted documents are there?
is there is any way to delete these huge number of deleted docs in one shot as looks like background merge operation is not able to keep up with huge number?
Thanks
your deleted documents are still part of the index so they impact the search performance ( but I can't tell you if its a huge impact ).
For the periodic merge, Lucene is "reluctant" to merge heavy segments as it requires some disk space and generates a lot of IO.
You can get some precious insight on your segments thanks to the Index Segments API
If you have segments close to the 5GB limit, it is probable that they won't be merged automatically until they are mostly constituted with deleted docs.
You can force a merge on your index with the force merge API
Remember a force merge can generate some stress on a cluster for huge indices. An option exists to only delete documents, that should reduce the burden.
only_expunge_deletes (Optional, boolean) If true, only expunge
segments containing document deletions. Defaults to false.
In Lucene, a document is not deleted from a segment; just marked as
deleted. During a merge, a new segment is created that does not
contain those document deletions.
Regards

Elasticsearch index by date search performance - to split or not to split

I am currently playing around with Elasticsearch (ES). We are ingesting sensor data and for 3 years we have approximately 1,000,000,000 documents in one index, making the index about 50GB in size. Indexing performance is not that important as new data only arrives every 15 minutes per sensor on average, therefore I want to focus on searching and aggregating performance. We are running a front-end showing basically a dashboard about average values from last week compared to one year before etc.
I am using ES on AWS and after performance on one machine was quite slow, I spun up a cluster with 3 data nodes (each 2 cores, 8 GB mem), and gave the index 3 primary shards and one replica. Throwing computing power at the data certainly improved the situation and more power would help more, but my question is:
Would splitting the index for example by month increase the performance? Or being more specific: is querying (esp. by date) a smaller index faster if I adjust the queries adequatly, or does ES already 'know' where to find specific dates in a shard?
(I know about other benefits of having smaller indices, like being able to roll over and keep only a specific time interval, etc.)
1/ Elasticsearch only knows where to find a specific date in an index if your index is sorted by your date field. You can check the documentation here.
In your use case, it can improve drastically search performance. And since all the data will be added at the "end of the index" since its date sorted, you should not see much of indexation overhead.
2/ Without index sort, smaller time-bounded indices will work better (even if you target all your indices) since it will often allow a rewrite or your range query to a match_all / match_none internal query.
For more information about this behavior you should read this blog post :
Instant Aggregations: Rewriting Queries for Fun and Profit

How to add another shard to production for tarantool Database, without downtime?

We use tarantool database (sharded using vshard) in production. We started directly with 4 shards. Now we want to increase it to 6 without downtime. But, after adding two more shards, rebalancer kicks in and it doesn't allow reads/writes to happen. Is there any way, that rebalancing can happen supporting all kinds of operations? We can afford to increase the operation time. But it should be a success. What is the best practice to add a shard to tarantool with the minimum inconvenience caused in the product front?
Currently, the Only solution we can think of is to go into maintenance mode and have the rebalance to finish with minimum time possible!!!
You can not write to a bucket that is being transferred right now, but you cant write to other buckets (so it's not like the whole shard is locked up).
Moreover, you can mitigate the effect by
- making buckets smaller (increase bucket_count)
- making rebalancing slower so that that less buckets are transferred simultaneoulsy (rebalancer config).
Suppose, you have 16384 buckets and your dataset is 75GB. It means that average bucket size is around 5 Mb. If you decrease rebalancer_max_receiving parameter to 10, you'll have only 10 buckets (50Mb) being transferred simultaneously (which makes him locked for writes).
This way, rebalancing will be pretty slow, BUT, given that your clients can perform retries and your network between shards is fast enough, the 'write-lock' effect should got unnoticed at all.

What's the downside to using daily indexes in Elasticsearch?

I have 10 indexes that I rotate on a weekly basis that can reach up to 100GB a piece with 10-20 million documents depending on the index. After I rotate I typically optimize, but this can take quite a while and I even bumped into an OOM issue with a particularly heavy index.
I thought about moving to daily indexes instead. This would speed up optimization and would allow me to archive/close indexes on a more granular level.
Is there any downside to using a daily over weekly rotation scheme? I know there are a lot of variables that might influence this, so if there isn't a straight answer, what are best practice with regards to index rotation?
Thanks!
You'll utilize more RAM for daily indexes, if you keep the shard/replication count the same for a daily index as you currently have for weekly. The more segements/shards/indexes, the more RAM your nodes will use.
Your optimize will likely be faster, true, and you can close on a daily level like you said.
Your queries should also be faster.
I'm in the other boat, I found this while researching what's been done re-indexing into weekly/monthly indexes after a set number of days. I was keeping 45+ daily indexes open, each 300-700GB, and JVM was running ~80%. I hope to take oldest days, 7 at a time, to convert to weekly, and lower ram usage, but still keep indexes open (then maybe to monthly, etc).

Resources