Mongodb collection _id - performance

By default _id field is generated as new ObjectId(), which has 96 bits (12 bytes).
Does _id size affect collection performance? What if I'll use 128 bits (160 bits or 256 bits) strings instead of native ObjectId?

In query performance, it is unlikely to matter. The index on _id is a sorted index implemented as a binary tree, so the actual length of the values doesn't matter much.
Using a longer string as _id will of course make your documents larger. Larger documents mean that less documents will be cached in RAM which will result in worse performance for larger databases. But when that string is a part of the documents anyway, using them as _id would save space because you won't need an additonal _id anymore.

By default _id filed is indexed(primary key) and if you tend to use a custom value set for it(say String) factually it will just consume more space. It will not have any significant impact on your query performance. Index size hardly contributes to query performance. You can verify this with sample code.

Related

What's the difference bettween Fielddata enabled vs eager global ordinals for optimising initial search query latency in Elasticsearch

I have an elasticsearch (7.10) cluster running that is primarily meant for powering search on text documents. The index that I'm working with does not need to be updated often, and there is no great necessity for speed during index time. Performance in this system is really needed for search time. The number of documents will likely always be in the range of 50-70 million and the store size is ~300GB once it's all built.
The mapping for the index and field I'm concerned with looks something like this:
"mappings": {
"properties": {
"document_text": {
"type": "text"
}
}
}
The document_text is a string of text anywhere in the region of 50-500 words. The typical queries being sent to this index are match queries chained together inside a boolean should query. Usually, the number of clauses are in the range of 5-15.
The issue I've been running into is that the initial latency for search queries to the index is very high usually in the range of 4-6s but after the first search the document is cached so the latency becomes much lower <1s. The cluster has 3 data nodes, 3 master nodes and 2 ingest/client nodes and is backed by fast SSD. I noticed that the heap on the data nodes is never really under too much pressure nor is the RAM this led me to realize that the documents weren't cached in advance the way I wanted them to be. From what I've researched I've landed on either enabling fielddata=true to get the field data object in memory at index time rather than constructing that at search time. I understand this will increase pressure on the JVM heap so I may do some frequency filtering to only place certain documents in memory. The other option I've come across is setting eager_global_ordinals=true which in some ways seems similar to enabling fielddata as it builds the mappings in-memory at index time also. I'm a bit new with ES and the terminology between the two is somewhat confusing to me. What I'd love to know is what is the difference between the two and does enable one or both of them to seem reasonable to solve the latency issues I'm having or I have completely misunderstood the docs. Thanks!
Enabling eager_global_ordinals won't do any kind of effect on your queries. Enabling global ordinals would only help for aggregations, doc values would be loaded at index refresh time instead of loading them at query time.
Enabling fielddata would also not do any real effect on your queries. Its primary purpose is sorting and aggregation, which you don't really want to do on a text field.
There's probably not much you can do with first ES queries being slower. Better focus on optimal index mappings, settings, shards, and document sizes.

ElasticSearch: Explaining the discrepancy between sum of all document "_size" and "store.size_in_bytes" API endpoint?

I'm noticing if I sum up the _size property of all my ElasticSearch documents in an index, I get a value of about 180 GB, but if I go to the _stats API endpoint for the same index I get a size_in_bytes value for all primaries to be 100 GB.
From my understanding the _size property should be the size of the _source field and the index currently stores the _source field, so should it not be at least as large as the sum of the _size?
The _size seems to be storing the actual size the source document. When actually storing the source in stored_fields, Elasticsearch would be compressing it(LZ4 default if I remember correctly). So I would expect it to be less size on disk than the actual size. And if the source doesn't have any binary data in it, the compression ratio is going to be significantly higher too.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

How to determine amount of memory used by different solr caches

According to Solr wiki https://wiki.apache.org/solr/SolrCaching
filterCache stores unordered sets of document IDs that match the key
queryResultCache stores ordered sets of document IDs
What is the document id being referred to here? What is its size? Is it a boolean bit vector with 1/0 for all documents present in the collection, such that its size is equivalent to total docs * 1 bit?
Also is there any way to get the exact size of each cache in bytes?

aggregations causing out of memory elasticsearch

I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks
Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.

Resources