A Range Search Query is causing Garbage Collection in Elastic Search - elasticsearch

I have an Elastic Search 5.2 cluster with 16 nodes (13 data nodes/3 master/24 GB RAM/12 GB Heap). I am performance testing a query and making 50 calls of a search query per second on the Elastic cluster. My query looks like the following -
{
"query": {
"bool": {
"must": [
{
"term": {
"cust_id": "AC-90-464690064500"
}
},
{
"range": {
"yy_mo_no": {
"gt": 201701,
"lte": 201710
}
}
}
]
}
}
}
My index mapping is like the following -
cust_id Keyword
smry_amt Long
yy_mo_no Integer // doc_values enabled
mkt_id Keyword
. . .
. . .
currency_cd Keyword // Total 10 field with 8 Keyword type
The index contains 200 million records and for each cust_id, there may be 100s of records. Index has 2 Replicas. The record size is under 100 bytes.
When I run the performance test for 10 minutes, the query response and performance seems to be very slow. Upon investigating a bit more in details in Kibana monitoring tab, It appears that there is a lot of Garbage Collection activity happening (pls. see Image below) -
I have several question clouding in my mind. I did some research on Range queries but didn't find much on what can cause GC activity in scenarios similar to mine. I also research on Memory usage and GC activity, but most of Elastic documentation refers that young generation GC is normal while Indexing, while search activity mostly use the file system cache that OS maintains. Thats why, in the chart above, Heap is not much used since Search was using File System cache.
So -
What might be causing the garbage collection to happen here ?
The chart shows that the Heap is still available to Elastic Search, and Used Heap is still very less as compared to available. Then what is triggering GC ?
Is the query type causing any internal data structure to be created that is getting disposed off, causing GC ?
The CPU spike may be due to GC activity.
Is there any other efficient way of running the Range query in Elastic Search pre 5.5 versions ?
Profiling the query tells that Elastic is running a TermQuery and a BooleanQuery with the later is costing the most.
Any idea whats going on here ?
Thanks in Advance,
SGSI.

The correct answer depends on index settings but I guess you are using integer type with enabled docValues. This data structure is supposed to support aggregations and sorting but not range queries. The right data type is range.
In case of DocValues elastic/lucene iterates over ALL documents(i.e. full scan) in order to match range query - this require to read and decode every value from DV column - this operation is quite expensive, especially when the index can not be cached by the operating system.

Related

optimize elastic search performance

I am trying to benchmark my elastic search setup by posting documents against a large schema. The two variations of schema are:
indexing enabled for each attribute.
indexing disabled for all the attributes.
My benchmark consists of only going to elastichq cluster and checking spikes in CPU.
However, I don't see the CPU spikes dropping when using the option 2.
Question: Disabling indexing should result in a better performance?
Setup:
Running elastic search on a docker with 1 shard and 1 replica for the index.
Schema with index enabled: https://pastebin.com/uXFkCCzY
Schema with index disabled: https://pastebin.com/FGSAFTMT
Document:
{
"status": "open",
"created_at": "2022-02-14",
"long_12": 123456789,
"division": {
"prop_1": 112211,
"prop_2": false,
"currency": "a brief text"
},
"emails":{
"email": "abc#gmail.com"
}
}
Load test scenario: created 10 Java threads running on i7 laptop and each thread posted 100000 documents with some modification (to keep the document distinct status field value was randomly generated).
More detail on why I am doing this:
So, my Production Elasticsearch (ES) cluster is performing very bad with Read going upwards of 10 second. And apart from all the necessary Read optimization I can do; I am also noticing that ES cluster is generally very busy. And I noticed that my ES index schema doesn't have indexing disabled for any attribute (and we have around 350 attributes).
So, my expectation was that if I set indexing disabled for unnecessary attributes, I can get some wins. However, that's not happening.
Can you please shed some light on:
Does setting index: false and enabled: false should have improved performance.
Am I disabling the index on attributes the right way.
Is my benchmarking technique right
NOTE Document and schema are for reference purpose only the actual schema and document in PROD is quite large. And the result was consistent when benchmarked using a large document.

elasticsearch query is giving results very slow on a simple query

I am using elasticsearch to perform some aggregations. Everything used to work fine, but currently I have 2 million docs in an index. I am performing a very simple search query list all documents in a given type of a given index.
{
"size":100000,
"query":
{"match_all":{}
}
}
This query is very slow and gives about 300k hits. What could be the possible reasons for it?
NOTE: i am having 2G ram . 2 cores
You are trying to get a response with 100.000 documents in it. This is just too much. Elasticsearch is intended for paging. Paging means fetch in small chunks. You try to fetch a bulk of 100.000. There is a reason why it defaults with a size of 10.
I finally found out that this configuration is enough for my needs that is searching over 2 million documents. i was having a wrong configuration and also the method of simply doing match_all is not correct even if we have 2 million docs performing a search based on some criteria would be very fast.

Elasticsearch: Search Performance of index with large documents (PDF,doc,txt) is slow

I have 65000 document(pdf,docx,txt,..etc) index in elastic-search using mapper-attachment. now I want to search content in that stored document using following query:
"from" : 0, "size" : 50,
"query": {
"match": {
"my_attachment.content": req.params.name
}
}
but it will take 20-30 seconds for results. It is very slow response. so what i have to do for quick response? any idea?
here is mapping:
"my_attachment": {
"type": "attachment",
"fields": {
"content": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
}
}
}
Since your machine has 4 CPUs and the index 5 shards, I'd suggest switching to 4 primary shards, which means you need to reindex. The reason for this approach is that at any given time one execution of the query will use 4 cores. And for one of the shards the query needs to wait. To have an equal distribution of load at query time, use 4 primary shards (=number of CPU cores) so that when you run the query there will not be too much contention at CPU level.
Also, by providing the output of curl localhost:9200/your_documents_index/_stats I saw that the "fetch" part (retrieving the documents from the shards) is taking 4.2 seconds per operation on average. This is likely the result of having very large documents or of retrieving a lot of documents. size: 50 is not a big number, but combined with large documents it will make the query to return the results in a longer time.
The content field (the one with the actual document in it) has store: true and if you want this for highlighting, the documentation says
In order to perform highlighting, the actual content of the field is required. If the field in question is stored (has store set to true in the mapping) it will be used, otherwise, the actual _source will be loaded and the relevant field will be extracted from it.
So if you didn't disable _source for the index, then that will be used and storing the content is not necessary. Also there is no magic for having a faster fetch, it's strictly related to how large your documents are and how many you want to retrieve. Not using store: true might slighly improve the time.
From nodes stats (curl -XGET "http://localhost:9200/_nodes/stats") there was no indication that the node has memory or CPU problems, so everything boils down to my previous suggestions.

Elasticsearch caching a single field for quick response

I have a cluster of 10 nodes where I index about a 100 million records daily. Total close to 6 billion records. I am constantly loading data. Each record has about 75 fields associated with it. 99% of my queries are based on the same field query. Essentially select * from table where groupid = 'value'. The majority of the queries returning bring back about a hundred records.
My queries currently take about 30 seconds to run the first 2 times and then are in the milliseconds. The problem is that all the user queries are searching for a different groupID so there queries are going to be slow for the most part until they run it the third time.
Is it possible to "cache" the groupid field so that I can get sub second queries.
My current query looks like this. (Psuedo-query) (I'm using non-analyzed field which I believe is better?)
query : {
filtered : {
filter : {
"term" : { groupID : "valuex" }
}
}
}
I"ve researched and not sure how to go about this. I've looked into doc_values = yes and possibly field cache?
I do not care about scoring, aggregates. My only use case is to filter out records and only bringing back the 100 or so out of 5 billion that have the correct groupID.
We have about 64G Memory on each server.
Just looking for help on how to achieve optimal performance/caching? or anything else that would help.
I thought about routing but this would be difficult based on our groupid values.
thanks
Starting from elasticsearch 2.0 we did some caching changes, like:
Keeps track of 256 most recently used queries
Only caches those that appear 5 times or more
Does not cache segments which have less than 10000 documents or 3% of the documents of the index
Wondering if you are hitting this last one.
Note that we did that because the File System cache might be probably better than internal caching.
Could you try with a bool query instead of a filtered query BTW? Filtered has been deprecated (and is removed in 5.0). And see how it performs?

Elasticsearch significant terms aggregation

I've started using the significant terms aggregation to see which keywords are important in groups of documents as compared to the entire set of documents I've indexed.
It works all great until a lot of documents are indexed. Then for the same query that used to work, elasticsearch only says:
SearchPhaseExecutionException[Failed to execute phase [query],
all shards failed; shardFailures {[OIWBSjVzT1uxfxwizhS5eg][demo_paragraphs][0]:
CircuitBreakingException[Data too large, data for field [text]
would be larger than limit of [633785548/604.4mb]];
My query looks the following:
POST /demo_paragraphs/_search
{
"query": {
"match": {
"django_target_id": 1915661
}
},
"aggregations" : {
"signKeywords" : {
"significant_terms" : {
"field" : "text"
}
}
}
}
And the document structure:
"_source": {
"django_ct": "citations.citation",
"django_target_id": 1915661,
"django_id": 3414077,
"internal_citation_id": "CR7_151",
"django_source_id": 1915654,
"text": "Mucin 1 (MUC1) is a protein heterodimer that is overexpressed in lung cancers [6]. MUC1 consists of two subunits, an N-terminal extracellular subunit (MUC1-N) and a C-terminal transmembrane subunit (MUC1-C). Overexpression of MUC1 is sufficient for the induction of anchorage independent growth and tumorigenicity [7]. Other studies have shown that the MUC1-C cytoplasmic domain is responsible for the induction of the malignant phenotype and that MUC1-N is dispensable for transformation [8]. Overexpression of",
"id": "citations.citation.3414077",
"num_distinct_citations": 0
}
The data that I index are paragraphs from scientifical papers. No document is really large.
Any ideas on how to analyze or solve the problem?
If the data set is to large to compute result on one machine you may need more then one node.
Be thoughtful when planning shard distribution. Make sure that shards are properly distributed so each node is equally stressed when computing heavy queries. A good topology for large data sets is Master-Data-Search configuration where you have one node which acts as master (no data, no queries running on this node). A few nodes are dedicated for holding data (shards) and some nodes are dedicated to execute queries (they do not hold data they use data nodes for partial query execution and combine results). For starter Netflix is using this topology Netflix raigad
Paweł Róg is right you will need much more RAM. For a starter increase java heap size available to each node. See this site for details: ElasticSearch configuration
You have to reasearch how much RAM is enough. Sometimes too much RAM actually slows down ES (unless it was fixed in one of recent versions).
I think there is simple solution.
Please give ES more RAM :D Aggregations require much memory.
Note that coming in elasticsearch 6.0 there is the new significant_text aggregation which doesn't require field data. See https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-significanttext-aggregation.html

Resources