ElasticSearch: Explaining the discrepancy between sum of all document "_size" and "store.size_in_bytes" API endpoint? - elasticsearch

I'm noticing if I sum up the _size property of all my ElasticSearch documents in an index, I get a value of about 180 GB, but if I go to the _stats API endpoint for the same index I get a size_in_bytes value for all primaries to be 100 GB.
From my understanding the _size property should be the size of the _source field and the index currently stores the _source field, so should it not be at least as large as the sum of the _size?

The _size seems to be storing the actual size the source document. When actually storing the source in stored_fields, Elasticsearch would be compressing it(LZ4 default if I remember correctly). So I would expect it to be less size on disk than the actual size. And if the source doesn't have any binary data in it, the compression ratio is going to be significantly higher too.

Related

why the size of stored index changed dramatically?

There is a company index, today I found a strange phenomenon, that is the doc count does not change but the size of stored index is changed dramatically from 22.03GB to 27.14GB then instantly becomes 20.44GB. Why is so? What are done in the elasticsearch internal?
More metrics
It seems the segment count and size both reduced, Did these metrics could explain this phenomenon?

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

What is the maximum Elasticsearch document size?

I read notes about Lucene being limited to 2Gb documents. Are there any additional limitations on the size of documents that can be indexed in Elasticsearch?
Lucene uses a byte buffer internally that uses 32bit integers for addressing. By definition this limits the size of the documents. So 2GB is max in theory.
In ElasticSearch:
There is a max http request size in the ES GitHub code, and it is set against Integer.MAX_VALUE or 2^31-1. So, basically, 2GB is the maximum document size for bulk indexing over HTTP. And also to add to it, ES does not process an HTTP request until it completes.
Good Practices:
Do not use a very large java heap if you can help it: set it only as large as is necessary (ideally no more than half of the machine’s RAM) to hold the overall maximum working set size for your usage of Elasticsearch. This leaves the remaining (hopefully sizable) RAM for the OS to manage for IO caching.
In client side, always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
For further study refer to these links:
Performance considerations for elasticsearch indexing
Document maximum size for bulk indexing over HTTP
Think things have changed slightly over the years with Elasticsearch. In the 7.x documentation referenced here - General Recommendations:
Given that the default http.max_content_length is set to 100MB, Elasticsearch will refuse to index any document that is larger than that. You might decide to increase that particular setting, but Lucene still has a limit of about 2GB.
So it would seem that ES has a limit of ~100MB and Lucene's is 2GB as the other answer stated.

How to determine amount of memory used by different solr caches

According to Solr wiki https://wiki.apache.org/solr/SolrCaching
filterCache stores unordered sets of document IDs that match the key
queryResultCache stores ordered sets of document IDs
What is the document id being referred to here? What is its size? Is it a boolean bit vector with 1/0 for all documents present in the collection, such that its size is equivalent to total docs * 1 bit?
Also is there any way to get the exact size of each cache in bytes?

Mongodb collection _id

By default _id field is generated as new ObjectId(), which has 96 bits (12 bytes).
Does _id size affect collection performance? What if I'll use 128 bits (160 bits or 256 bits) strings instead of native ObjectId?
In query performance, it is unlikely to matter. The index on _id is a sorted index implemented as a binary tree, so the actual length of the values doesn't matter much.
Using a longer string as _id will of course make your documents larger. Larger documents mean that less documents will be cached in RAM which will result in worse performance for larger databases. But when that string is a part of the documents anyway, using them as _id would save space because you won't need an additonal _id anymore.
By default _id filed is indexed(primary key) and if you tend to use a custom value set for it(say String) factually it will just consume more space. It will not have any significant impact on your query performance. Index size hardly contributes to query performance. You can verify this with sample code.

Resources