Does field length affect elasticsearch performance?

Does field length affect elasticsearch performance? - elasticsearch

Will the size of the elastic search index decrease (and performance
increase due to reduce memory footprint) if I were to shorten field
names?

Fields names are stored in the field info file, with suffix .fnm
FieldName is just a UTF-8 string there.
You could estimate the current size of it and make any assumptions about how much it would save you space, but I’m pretty much sure, that it will be negligible, so there is very little sense in optimizing field names.
For example, my tiny playground index of the size 500 Mb with around 100-150 fields have total size of all field info files to be 188 kB which makes it 0.04% out of total size.

Related

Elasticsearch / Lucene: Determine total bytes used by field in index

I have a lucene index (generated by ES 6.X.X) and I want to understand how much storage space (bytes) each of my fields is consuming.
Does anyone know how I can figure out how many bytes it requires to store all of the values for a given field across all documents? I have the index opened up in Luke, but I am new to that tool and it's not clear to me how I can answer this question. Is there a better tool for this?

Caching in elasticsearch

In my index I indicated index.store.preload: ["*"]
In order to preload data into the file system cache. The entire index occupies 27 Gb, about 45 Gb is allocated to the cache in the system, and all this memory is full, it turns out that not only 27 Gb is crawled into the cache, but also something else. Is it possible to somehow find out how much the total index space in the cache will occupy? Also, I don’t understand the difference between the file system cache and the use of indices.fielddata.cache. Which one will be more practical for a faster search? Does it make sense to use both options?

The data on the disk has been compressed markedly before being synced. So there's always relatively apparent inflation after being loaded into memory. As the specific inflation rate is related to data type, field count or index options, it's fundamentally impossible to give an accurate estimation about the final size in the memory.
FieldData is docValue of Lucene, which is used to sort or agg. If you just want to search by single query, this cache is of little help. By comparison, file system cache is always used by elasticsearch and constructs the foundation of ES, both in query and filter context.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.

UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

Mongodb collection _id

By default _id field is generated as new ObjectId(), which has 96 bits (12 bytes).
Does _id size affect collection performance? What if I'll use 128 bits (160 bits or 256 bits) strings instead of native ObjectId?

In query performance, it is unlikely to matter. The index on _id is a sorted index implemented as a binary tree, so the actual length of the values doesn't matter much.
Using a longer string as _id will of course make your documents larger. Larger documents mean that less documents will be cached in RAM which will result in worse performance for larger databases. But when that string is a part of the documents anyway, using them as _id would save space because you won't need an additonal _id anymore.

By default _id filed is indexed(primary key) and if you tend to use a custom value set for it(say String) factually it will just consume more space. It will not have any significant impact on your query performance. Index size hardly contributes to query performance. You can verify this with sample code.

Index linear growth - Performance degradation

We have 4 shards with 14GB index on each of them
Each shard has a master and 3 slaves (each of them with 32GB RAM)
We're expecting that the index size will grow to double or triple in near future.
So we thought of merging our indexes to 28GB index so that each shard has 28GB index and also increased our RAM on each slave to 48GB.
We made this changes locally and tested the server by sending same 10K realistic queries to each server with 14GB & 28GB index, we found that
For server with 14GB index (48GB RAM): search time was 480ms, number of index hits: 3.8G
For server with 28GB index (48GB RAM): search time was 900ms, number of index hits: 7.2G
So we saw that having the whole index in RAM doesn't help in sustaining the performance in terms of search time. Search time increased linearly to double when the index size was doubled.
We were thinking of keeping only 4 shards configuration but it looks like now we have to add another shard or another slave to each shard.
Is there any other way that we can configure our servers so that the performance isn't affected even when index size doubles or triples?

I'd hate to say it depends, but it... depends.
The total size of your index on each is 14GB, which basically doesn't mean much of anything to SOLR. To get a real feel for performance what is the uniqueness of the terms indexed? An index of 14GB worth of data with the single word "cat" in it over and over again will be really quick.
Also have you confirmed you need the following features, disabling them can boost performance large amounts:
Schema
Stored Fields
Do you need stored fields? Removing this can greatly increase performance (you can safely have an entire index without any stored fields and rely completely on facets, pivots, and other features in solr to drive a UX).
omitNorms
You can, in some instances, set this flag to false to reduce memory in general and increase performance.
omitTermFreqAndPositions
Can be turned off, reduced memory in general and increase in performance.
System
Optimize Core/Index (Segment Count)
Index optimization is important when dealing with larger index sizes. Ensure each core is optimized and that when you look at the core it says the segment count is = 1. What I found is that this play a more important role as you increase the index size (this plays into OS level file caching and the fact it's easier to read one large file, rather than multiple small files) And yes, that does say 171 million+ documents.
Term Index Interval/Frequency
Configuration of term index interval may be required (by default 256) if you have a field or multiple fields that contain very unique values (for example GUID/UUIDs or unique IDs in general). Typically, the lower the TIF the more memory you need, the higher the TIF the less memory you need but the more disk seeks you may have.
Allocation of too much Ram
Solr works best with a good split between OS level disk cache and RAM used when faceting, you'd be surprised that you could actually get better performance by tweaking other parameters which lower required ram usage and free up resources for disk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio