aggregations causing out of memory elasticsearch - elasticsearch

I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks

Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.

Related

What's the difference bettween Fielddata enabled vs eager global ordinals for optimising initial search query latency in Elasticsearch

I have an elasticsearch (7.10) cluster running that is primarily meant for powering search on text documents. The index that I'm working with does not need to be updated often, and there is no great necessity for speed during index time. Performance in this system is really needed for search time. The number of documents will likely always be in the range of 50-70 million and the store size is ~300GB once it's all built.
The mapping for the index and field I'm concerned with looks something like this:
"mappings": {
"properties": {
"document_text": {
"type": "text"
}
}
}
The document_text is a string of text anywhere in the region of 50-500 words. The typical queries being sent to this index are match queries chained together inside a boolean should query. Usually, the number of clauses are in the range of 5-15.
The issue I've been running into is that the initial latency for search queries to the index is very high usually in the range of 4-6s but after the first search the document is cached so the latency becomes much lower <1s. The cluster has 3 data nodes, 3 master nodes and 2 ingest/client nodes and is backed by fast SSD. I noticed that the heap on the data nodes is never really under too much pressure nor is the RAM this led me to realize that the documents weren't cached in advance the way I wanted them to be. From what I've researched I've landed on either enabling fielddata=true to get the field data object in memory at index time rather than constructing that at search time. I understand this will increase pressure on the JVM heap so I may do some frequency filtering to only place certain documents in memory. The other option I've come across is setting eager_global_ordinals=true which in some ways seems similar to enabling fielddata as it builds the mappings in-memory at index time also. I'm a bit new with ES and the terminology between the two is somewhat confusing to me. What I'd love to know is what is the difference between the two and does enable one or both of them to seem reasonable to solve the latency issues I'm having or I have completely misunderstood the docs. Thanks!
Enabling eager_global_ordinals won't do any kind of effect on your queries. Enabling global ordinals would only help for aggregations, doc values would be loaded at index refresh time instead of loading them at query time.
Enabling fielddata would also not do any real effect on your queries. Its primary purpose is sorting and aggregation, which you don't really want to do on a text field.
There's probably not much you can do with first ES queries being slower. Better focus on optimal index mappings, settings, shards, and document sizes.

consequences of increasing max_result_window on elastic search

we have an index which default max_result_window was set up to 10000, but our data is increasing and we expect we have more than 1 Million Docs there, on of our requirements is scroll all data from the start to end with 1000 in each epic , our documents are not very big and I'll write down one example on following :
{
"serp_query": "c=44444&ct=333333",
"uid": "5815697",
"notify_status": 0,
"created_at": "2018-02-04 10:00:00"
}
I've set max_result_window to 10,000,000 but at this time we have almost 50K docs in our Index, I've read the some texts about consequences of this increasing
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling
https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
But we our Documents are not too big and our Elastic Server has 16GB dedicated RAM and guess there is not problem,
I'm writing to ask two questions,
according to the sample Doc ( all our docs should have the same fields) how much it could be Big for one Million Docs,I mean how much heap memory will needed for handle this?
is it very bad solution and will faced us with big problem in future ? are we use scrolling instead of offset and start?
our query is not very complicated, loop on all data ordered by "created_at" descending and get 1000 Docs in each epic.
FYI: our elastic search engine version in 2.7
Just to share the Result with others,
If the Document is not very big and your queries are not very complicated increasing max_result_window has not big effect on performance.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

Increase Solr performance when querying a subset of documents

The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.

Elasticsearch - implications of splitting documents into separate indexes

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.
Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.
What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:
Will it increase my server HDD requirements?
Will it increase my server RAM requirements?
Will elasticsearch be slower to search by querying the alias to query all the indexes?
Thank you for any help or advice.
Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:
Require some amount of RAM for insert buffers and other per-index related tasks
Have it's own merge overhead on disk relative to a larger single index
Provide some latency increase at query time due to result merging if a query spans multiple indexes
There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.
I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

Resources