Term filter causes memory to spike drastically - elasticsearch

The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?

It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.

Related

ElasticSearch - slow aggregations and impact on performance of other operations

There is an aggregation to identify duplicate records:
{
"size": 0,
"aggs": {
"myfield": {
"terms": {
"field": "myfield.keyword",
"size": 250,
"min_doc_count": 2
}
}
}
}
However it is missing many duplicates due to the low size. The actual cardinality is over 2 million. If size is changed to the actual size or some other much larger number, all of the duplicate documents are found, but the operation takes 5X more time to complete.
If I change the size to a larger number, should I expect slow performance or other adverse effects on other operations while this is running?
Yes, size param is very critical in Elasticsearch aggregation performance and if you change it very big number like 10k (limit set by Elasticsearch but you can change that by changing search.max_buckets but it will surely have adverse impact not only on the aggregation you are running but on all the operation running in Elasticsearch cluster.
As you are using terms aggregation which is of bucket aggregation, you can read more here
Note: Reason for increasing the latency when you increase the size is that Elasticsearch has to do significant processing creating those many buckets and compute the entries for those buckets.

Does "from" parameter in ElasticSearch Impact the ElasticSearch Cluster?

I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:
{
"explain": false,
"size": 100,
"from": <>,
"_source": {
"excludes": [],
"includes": [
<around 850 fields>
]
},
"sort": [
<sorting from an string field>
]
}
This is a classic problem of deep pagination. You may read the link on pagination in Elasticsearch. Essentially, to get the next set documents after skipping 100000 documents would be an memory intensive task because to attain a result set of 100000+ documents, 100000+ documents need to fetched from each shard and then processed (ranking, sorting, etc.). Ranking/Sorting over a smaller result set takes lesser time that doing that on a larger result set.

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

The query:
{
"aggregations": {
"sigTerms": {
"significant_terms": {
"field": "translatedTitle"
},
"aggs": {
"assocs": {
"significant_terms": {
"field": "translatedTitle"
}
}
}
}
},
"size": 0,
"from": 0,
"query": {
"range": {
"timestamp": {
"lt": "now+1d/d",
"gte": "now/d"
}
}
},
"track_scores": false
}
Error:
{
"bytes_limit": 6844055552,
"bytes_wanted": 6844240272,
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
"type": "circuit_breaking_exception"
}
Index size is 5G. How much memory does the cluster need to execute this query?
You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.request.limit: 41%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.request.limit" : "41%"
}
}'
Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.
Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.
I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.
The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).
After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.
The big question is what are you trying to do?
If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.
You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.
Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.
Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.request.limit" : "45%"
}
}
You can get more information on
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

ElasticSearch + Kibana - Unique count using pre-computed hashes

update: Added
I want to perform unique count on my ElasticSearch cluster.
The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size.
The above also fails for a single index with 4 millions of records.
My questions
Am I missing something in my configurations?
Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana:
http://pastebin.com/hf1yNLhE
ElasticSearch Stack trace
http://pastebin.com/BFTYUsVg
That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.
Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.

is _id of document affects on scoring?

I add two same documents the only different thing is _id of documents (I restart scenario for each of them and I do not add them sequentially. to be sure my test is correct)
one of them changes order of result of this query and one of them does not:
GET index_for_test/business/_search
{
"query": {
"multi_match": {
"query": "italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
my original question was:
https://github.com/elastic/elasticsearch/issues/10341
as mentioned here: https://groups.google.com/forum/?fromgroups=&hl=en-GB#!topic/elasticsearch/VWqA_P4zzH8
my answer is in this documentation:
https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
documents are spread in 5 shards by default and queries run with an algorithm that scores documents in each shard and then fetch them, in small data this ends to inaccurate result so if the database is small it is better to run you queries with search_type=dfs_query_then_fetch but it has scalability problems and should be changed when it grows

Resources