How to determine amount of memory used by different solr caches - caching

According to Solr wiki https://wiki.apache.org/solr/SolrCaching
filterCache stores unordered sets of document IDs that match the key
queryResultCache stores ordered sets of document IDs
What is the document id being referred to here? What is its size? Is it a boolean bit vector with 1/0 for all documents present in the collection, such that its size is equivalent to total docs * 1 bit?
Also is there any way to get the exact size of each cache in bytes?

Related

ElasticSearch: Explaining the discrepancy between sum of all document "_size" and "store.size_in_bytes" API endpoint?

I'm noticing if I sum up the _size property of all my ElasticSearch documents in an index, I get a value of about 180 GB, but if I go to the _stats API endpoint for the same index I get a size_in_bytes value for all primaries to be 100 GB.
From my understanding the _size property should be the size of the _source field and the index currently stores the _source field, so should it not be at least as large as the sum of the _size?
The _size seems to be storing the actual size the source document. When actually storing the source in stored_fields, Elasticsearch would be compressing it(LZ4 default if I remember correctly). So I would expect it to be less size on disk than the actual size. And if the source doesn't have any binary data in it, the compression ratio is going to be significantly higher too.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

Mongodb collection _id

By default _id field is generated as new ObjectId(), which has 96 bits (12 bytes).
Does _id size affect collection performance? What if I'll use 128 bits (160 bits or 256 bits) strings instead of native ObjectId?
In query performance, it is unlikely to matter. The index on _id is a sorted index implemented as a binary tree, so the actual length of the values doesn't matter much.
Using a longer string as _id will of course make your documents larger. Larger documents mean that less documents will be cached in RAM which will result in worse performance for larger databases. But when that string is a part of the documents anyway, using them as _id would save space because you won't need an additonal _id anymore.
By default _id filed is indexed(primary key) and if you tend to use a custom value set for it(say String) factually it will just consume more space. It will not have any significant impact on your query performance. Index size hardly contributes to query performance. You can verify this with sample code.

aggregations causing out of memory elasticsearch

I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks
Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.

Same index data but different store size in ElasticSearch?

I am evaluating the necessary storage size required by ElasticSearch. However, I find that the store size varies every time while indexing the same set of data.
For example, the size of the data I used is 35mb. The indexing ran for several times, and the result store sizes are between 76mb ~ 85mb, not a fixed number (not repeatable?)
Can someone explain this? Thanks in advance:)
After you've inserted all of your data, have you tried doing an optimze (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html) to bring the number of segments down to 1?
Basically the time at which it does the Lucene segment merges causes the differences in sizes you are seeing. They are not deterministic because once the merge kicks off, the amount of data you insert before the merge completes affects the size of the remaining segments. You can read a little more about the segment merges here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html and here: Understanding Segments in Elasticsearch

Resources