Same index data but different store size in ElasticSearch? - elasticsearch

I am evaluating the necessary storage size required by ElasticSearch. However, I find that the store size varies every time while indexing the same set of data.
For example, the size of the data I used is 35mb. The indexing ran for several times, and the result store sizes are between 76mb ~ 85mb, not a fixed number (not repeatable?)
Can someone explain this? Thanks in advance:)

After you've inserted all of your data, have you tried doing an optimze (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html) to bring the number of segments down to 1?
Basically the time at which it does the Lucene segment merges causes the differences in sizes you are seeing. They are not deterministic because once the merge kicks off, the amount of data you insert before the merge completes affects the size of the remaining segments. You can read a little more about the segment merges here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html and here: Understanding Segments in Elasticsearch

Related

Elastic Search force merging and merging(to do or not to do force merge)

I have a scenario in which we have indices for each month, and each document has a field, say expiry date, and according to that expiry date, documents will be deleted when they reach their expiry. One-month-old index will be moved to a warm node from a hot node(All my queries below will be pertaining to the indexes that are on warm nodes).
Now I understand Elastic search will merge the segments as needed.
here is my first question, how will elastic search determine that now is the need to merge the segments?
I have come across a property index.merge.policy.expunge_deletes_allowed which has a default value of 10%, does this property dictate when the merging will happen? And it says 10% deleted document, what does that exactly mean? let's suppose if a segment has 100 documents and I deleted 11 of them(that happens to be on the same segment) does that mean the default limit of 10% has been met?
Coming back to the scenario when my documents get deleted at some point there will come a time when all the documents in an index get deleted. What will segments of that index look like then? will it have 0 segments or just 1 to hold index metadata?
Another question regarding the force merge is if I happen to choose force merge to get rid of all the deleted documents from the disk and if force merging resulted in a segment of size greater than 5 GB so as written here. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html#forcemerge-api-desc
Snippet :
Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.
When will my segment(greater than 5GB will get merged automatically if at all?) as it says it will when it consists mostly of deleted documents? That's vague what does mostly mean here? what's the threshold?
Another question is, it is suggested that force merge should only be done on indexes that are read-only. Why is that? how does it degrade performance? coming back to the scenario I will have some updates and new documents coming on my indexes on warm nodes even after I force merge them, but the frequency of those updates and new documents will be very less(we can say less than 5% of the documents will be updated, and less than and maybe a couple of hundred new documents could be added to those indexes).
Also what if I am force-merging 4 450GB indexes( each with 16 shards) in parallel, how will it affect my searching speed? I read somewhere that by default each force merge request is executed in a single thread and that too is throttled if need be? does that mean if search requests increases the merging will be paused?
Thank you for your patience and time.

Does huge number of deleted doc count affects ES query performance

I have few read heavy indices(started seeing performance issues on these indices) in my ES cluster which has ~50 million docs and noticed most of them have around 25% of total documents as deleted, I know that these deleted document count decrease over time when background merge operation happens, But in my case these count is always around ~25% of total documents and I have below questions/concerns:
Will these huge no of deleted count affects the search performance as they are still part of lucene immutable segments and search happens to all the segments and latest version of document is returned, so size of immutable segments would be high as they contains huge number of deleted docs and then another operation to figure out the latest version of doc.
Will periodic merge operation would take lot of time and inefficient if huge number of deleted documents are there?
is there is any way to delete these huge number of deleted docs in one shot as looks like background merge operation is not able to keep up with huge number?
Thanks
your deleted documents are still part of the index so they impact the search performance ( but I can't tell you if its a huge impact ).
For the periodic merge, Lucene is "reluctant" to merge heavy segments as it requires some disk space and generates a lot of IO.
You can get some precious insight on your segments thanks to the Index Segments API
If you have segments close to the 5GB limit, it is probable that they won't be merged automatically until they are mostly constituted with deleted docs.
You can force a merge on your index with the force merge API
Remember a force merge can generate some stress on a cluster for huge indices. An option exists to only delete documents, that should reduce the burden.
only_expunge_deletes (Optional, boolean) If true, only expunge
segments containing document deletions. Defaults to false.
In Lucene, a document is not deleted from a segment; just marked as
deleted. During a merge, a new segment is created that does not
contain those document deletions.
Regards

elasticsearch - index size change

I insert some data to elasticsearch and here is result screen from es-head.
I do nothing but size is getting smaller. why? and can it be bigger somehow?
Elasticsearch (or actually the underlying Apache Lucene) writes data immutably. The only way to decrease the file size is through a merge, which will happen automatically in the background.
Quoting from the documentation:
Smaller segments are periodically merged into larger segments to keep the index size at bay and to expunge deletes.
I would only expect bigger changes if you have lots of deleted documents. Note however that every update is treated as a (soft) delete and insert — immutable nature of Lucene segments.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

aggregations causing out of memory elasticsearch

I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks
Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.

Resources