why the size of stored index changed dramatically? - elasticsearch

There is a company index, today I found a strange phenomenon, that is the doc count does not change but the size of stored index is changed dramatically from 22.03GB to 27.14GB then instantly becomes 20.44GB. Why is so? What are done in the elasticsearch internal?
More metrics
It seems the segment count and size both reduced, Did these metrics could explain this phenomenon?

Related

Elastic Search force merging and merging(to do or not to do force merge)

I have a scenario in which we have indices for each month, and each document has a field, say expiry date, and according to that expiry date, documents will be deleted when they reach their expiry. One-month-old index will be moved to a warm node from a hot node(All my queries below will be pertaining to the indexes that are on warm nodes).
Now I understand Elastic search will merge the segments as needed.
here is my first question, how will elastic search determine that now is the need to merge the segments?
I have come across a property index.merge.policy.expunge_deletes_allowed which has a default value of 10%, does this property dictate when the merging will happen? And it says 10% deleted document, what does that exactly mean? let's suppose if a segment has 100 documents and I deleted 11 of them(that happens to be on the same segment) does that mean the default limit of 10% has been met?
Coming back to the scenario when my documents get deleted at some point there will come a time when all the documents in an index get deleted. What will segments of that index look like then? will it have 0 segments or just 1 to hold index metadata?
Another question regarding the force merge is if I happen to choose force merge to get rid of all the deleted documents from the disk and if force merging resulted in a segment of size greater than 5 GB so as written here. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html#forcemerge-api-desc
Snippet :
Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.
When will my segment(greater than 5GB will get merged automatically if at all?) as it says it will when it consists mostly of deleted documents? That's vague what does mostly mean here? what's the threshold?
Another question is, it is suggested that force merge should only be done on indexes that are read-only. Why is that? how does it degrade performance? coming back to the scenario I will have some updates and new documents coming on my indexes on warm nodes even after I force merge them, but the frequency of those updates and new documents will be very less(we can say less than 5% of the documents will be updated, and less than and maybe a couple of hundred new documents could be added to those indexes).
Also what if I am force-merging 4 450GB indexes( each with 16 shards) in parallel, how will it affect my searching speed? I read somewhere that by default each force merge request is executed in a single thread and that too is throttled if need be? does that mean if search requests increases the merging will be paused?
Thank you for your patience and time.

Does huge number of deleted doc count affects ES query performance

I have few read heavy indices(started seeing performance issues on these indices) in my ES cluster which has ~50 million docs and noticed most of them have around 25% of total documents as deleted, I know that these deleted document count decrease over time when background merge operation happens, But in my case these count is always around ~25% of total documents and I have below questions/concerns:
Will these huge no of deleted count affects the search performance as they are still part of lucene immutable segments and search happens to all the segments and latest version of document is returned, so size of immutable segments would be high as they contains huge number of deleted docs and then another operation to figure out the latest version of doc.
Will periodic merge operation would take lot of time and inefficient if huge number of deleted documents are there?
is there is any way to delete these huge number of deleted docs in one shot as looks like background merge operation is not able to keep up with huge number?
Thanks
your deleted documents are still part of the index so they impact the search performance ( but I can't tell you if its a huge impact ).
For the periodic merge, Lucene is "reluctant" to merge heavy segments as it requires some disk space and generates a lot of IO.
You can get some precious insight on your segments thanks to the Index Segments API
If you have segments close to the 5GB limit, it is probable that they won't be merged automatically until they are mostly constituted with deleted docs.
You can force a merge on your index with the force merge API
Remember a force merge can generate some stress on a cluster for huge indices. An option exists to only delete documents, that should reduce the burden.
only_expunge_deletes (Optional, boolean) If true, only expunge
segments containing document deletions. Defaults to false.
In Lucene, a document is not deleted from a segment; just marked as
deleted. During a merge, a new segment is created that does not
contain those document deletions.
Regards

ElasticSearch - Setting readonly to an index improves performance?

I have this scenario where a system generates an index every day inside a node. Each index has 1 primary shard.
So, all the documents indexed in a day goes to a certain index, and after the day passed, a new index is created.
I keep the indices of the last 60 days (so this means that I always have 60 shards in the node). I can't close the old indices because I want them to support searches. After the 60th day passed then I delete them.
As I was reading the following article I noticed this about the index buffer:
It defaults to 10%, meaning that 10% of the total memory allocated to a node will be used as the indexing buffer size. This amount is then divided between all the different shards
This means that for the index of the day I have 10% / 60 of buffer index memory. So I am not really using the 10%.
The question is, what happens if I set to read only the older indices:
index.blocks.read_only
Set to true to make the index and index metadata read only, false to allow writes and metadata changes.
Will I see a benefit for doing this? Like having the entire 10% of index buffer in my only writable index? Or an improve in the searches of the other indices since they can be merged in an only segmented as they do not longer recieves writes?
Thanks!
Pablo
PD: Im using ElasticSearch 1.7.3 but Im looking forward to migrate to 2.2 in a future. I would like to know, if possible, if I have a benefit if this would be translated to 2.2
No, you won't get a benefit from adding the read_only block. The only way you would free up the buffer is to close the index entirely.
Also, you may be interested to know about https://github.com/elastic/elasticsearch/pull/14121 which gives "actively indexing" shards a bigger share of the buffer. Coming in Elasticsearch 5.0.

Same index data but different store size in ElasticSearch?

I am evaluating the necessary storage size required by ElasticSearch. However, I find that the store size varies every time while indexing the same set of data.
For example, the size of the data I used is 35mb. The indexing ran for several times, and the result store sizes are between 76mb ~ 85mb, not a fixed number (not repeatable?)
Can someone explain this? Thanks in advance:)
After you've inserted all of your data, have you tried doing an optimze (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html) to bring the number of segments down to 1?
Basically the time at which it does the Lucene segment merges causes the differences in sizes you are seeing. They are not deterministic because once the merge kicks off, the amount of data you insert before the merge completes affects the size of the remaining segments. You can read a little more about the segment merges here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html and here: Understanding Segments in Elasticsearch

Index linear growth - Performance degradation

We have 4 shards with 14GB index on each of them
Each shard has a master and 3 slaves (each of them with 32GB RAM)
We're expecting that the index size will grow to double or triple in near future.
So we thought of merging our indexes to 28GB index so that each shard has 28GB index and also increased our RAM on each slave to 48GB.
We made this changes locally and tested the server by sending same 10K realistic queries to each server with 14GB & 28GB index, we found that
For server with 14GB index (48GB RAM): search time was 480ms, number of index hits: 3.8G
For server with 28GB index (48GB RAM): search time was 900ms, number of index hits: 7.2G
So we saw that having the whole index in RAM doesn't help in sustaining the performance in terms of search time. Search time increased linearly to double when the index size was doubled.
We were thinking of keeping only 4 shards configuration but it looks like now we have to add another shard or another slave to each shard.
Is there any other way that we can configure our servers so that the performance isn't affected even when index size doubles or triples?
I'd hate to say it depends, but it... depends.
The total size of your index on each is 14GB, which basically doesn't mean much of anything to SOLR. To get a real feel for performance what is the uniqueness of the terms indexed? An index of 14GB worth of data with the single word "cat" in it over and over again will be really quick.
Also have you confirmed you need the following features, disabling them can boost performance large amounts:
Schema
Stored Fields
Do you need stored fields? Removing this can greatly increase performance (you can safely have an entire index without any stored fields and rely completely on facets, pivots, and other features in solr to drive a UX).
omitNorms
You can, in some instances, set this flag to false to reduce memory in general and increase performance.
omitTermFreqAndPositions
Can be turned off, reduced memory in general and increase in performance.
System
Optimize Core/Index (Segment Count)
Index optimization is important when dealing with larger index sizes. Ensure each core is optimized and that when you look at the core it says the segment count is = 1. What I found is that this play a more important role as you increase the index size (this plays into OS level file caching and the fact it's easier to read one large file, rather than multiple small files) And yes, that does say 171 million+ documents.
Term Index Interval/Frequency
Configuration of term index interval may be required (by default 256) if you have a field or multiple fields that contain very unique values (for example GUID/UUIDs or unique IDs in general). Typically, the lower the TIF the more memory you need, the higher the TIF the less memory you need but the more disk seeks you may have.
Allocation of too much Ram
Solr works best with a good split between OS level disk cache and RAM used when faceting, you'd be surprised that you could actually get better performance by tweaking other parameters which lower required ram usage and free up resources for disk.

Resources