Elastic gives inconsistent results under stress

Elastic gives inconsistent results under stress - elasticsearch

Our ES is fairly slow, we did not optimize it (and the query) yet, but according to this link, request rejection from Elastic is a form of a feedback that asks to slow down and adapt the size of the bulk.
We built a form of a back pressure where the size of a blocking bulk (a list of individual requests sent at the same time, we do not use MSearch yet) depends on how many requests were rejected in the previous bulk. We wait for current bulk to finish before starting a new one. Obviously all rejected requests are re-injected into the request-queue (in a form of a data needed to construct the query). For example if our Elastic can handle 500 simultaneous requests and we send 600, some of them will be rejected and the new size will be reduced to 480 (20% off).
What we found out was that ES returns different results for the previously rejected requests. For example it may return something like the expected result, but with an offset of 2. We also have missing results where an address should have 1 result, but has none due to this bug.
If the bulk size is less than the threshold that ES can handle, everything goes as expected and we get expected results.
It doesn't look like it's the library's (elastic4s) problem.
Elastic configuration:
2 nodes with 5 shards each
Per node:
2 CPU, 32 GB ram, 16 GB heap. Everything else is default
I couldn't find any information on the internet, did anyone have this problem? What was the solution?
What we tried so far:
Thread.sleep between bulks as the link above suggests.
Removing cache on query level as well as removing it from the index.
Trying same index on a different (slower) hardware.
Verified that it's not a race-condition (in our code) problem.
Update:
What the query like.
Thread pool for search:
"search" : {
"type" : "fixed",
"min" : 4,
"max" : 4,
"queue_size" : 1000
},
2nd UPDATE:
We also tried setting preference to our query (thinking that it was a problem with shards): .preference(Preference.Primary) with no positive result (they were even more random than before). Two consecutive runs with this setting give different "random" results, so this is not consistent.

The reason for inconsistent results was that Elastic replies with Success if at least 1 shard had a result. So basically if only one of our 5 shards succeeded, the request would return a successful result with only 20% of the data.
As seen here and here, this is not a bug, this is a feature. Elastic prefers to return some (albeit, inconsistent) result instead of not returning anything.
The solution to this problem is either to use only one shard or to treat more than 0 failed shards as a general request failure using following object that each ES response has:
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},

Related

ElasticSearch: Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
My script
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks

The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.
Depending on what you want to do with the retrieved documents,
try to use the scroll api (as suggested in your error message) if its a batch job. Be mindful of the lifetime of scroll context in that case.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
or, use the Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after

You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.

The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.
This limit can be set by changing the [index.max_result_window] index
level setting.
Please refer update indices level setting on how to change that.
So for your setting it would look like:
PUT /<your-index-name>/_settings
{
"index" : {
"index.max_result_window" : 65000 -> note its equal to your all the docs in your index
}
}

Elasticsearch giving cached result even after 5-6 seconds

My System is calling elasticsearch. After updating a document I would like to fetch the same document again. While doing so elasticsearch sometimes fetches cached results (results before the update) even after retrying the elasticsearch get after 5-6 seconds.
I have used refresh:'wait_for' while updating the document. Can anyone help me what can be a workaround for this? I would like to fetch the latest revision of the updated document. My query to fetch is:
body: {
query: {
terms: {
_id: [
idsToFetch
]
}
}
}

First, you can check whats the refresh interval set for your index defaults to 1 second, in this case: refresh:wait_for should return back in maximum 1 second but as explained in official ES documents :
If the refresh interval is set to -1, disabling the automatic
refreshes, then requests with refresh=wait_for will wait indefinitely
until some action causes a refresh. Conversely, setting
index.refresh_interval to something shorter than the default like
200ms will make refresh=wait_for come back faster, but it’ll still
generate inefficient segment
You can get the whats the refresh_interval set for index using https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-settings.html, please note it would come in the result only if it's not set to its default value.
Let me know if you face any issue or have more question.

Error 429 [type=reduce_search_phase_exception]

I have many languages for my docs and am following this pattern: One index per language. In that they suggest to search across all indices with the
/blogs-*/post/_count
pattern. For my case I am getting a count across the indices of how many docs I have. I am running my code concurrently so making many requests at same time. If I search
/blogs-en/post/_count
or any other language then all is fine. However if I search
/blogs-*/post/_count
I soon encounter:
"Error 429 (Too Many Requests): [reduce] [type=reduce_search_phase_exception]
"
Is there a workaround for this? The same number of requests is made regardless of if I use
/blogs-en/post/_count or /blogs-*/post/_count.
I have always used the same number of workers in my code but re-arranging the indices to have one index per language suddenly broke my code.
EDIT: It is a brand new index without any documents when I start the program and when I get the error I have about 5,000 documents so not under any heavy load.
Edit: I am using the mapping found in the above-referenced link and running on a local machine with all the defaults of ES...in my case shards=5 and replicas=1. I am really just following the example from the link.
EDIT: The errors are seen with as few as 13-20 requests are made and I know ES can handle more than that. Searching /blogs-en/post/_count instead of /blogs-*/post/_count, etc.. can easily handle thousands with no errors.
Another Edit: I have removed all concurrency but still can only access 40-50 requests before I get the error.

I don't get an error for that request and it returns total documents.
Is you'r cluster under load?
Anyway, using simple aggregation you can get total document count in hits.total and per index document count in count_per_index part of result:
GET /blogs-*/post/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"count_per_index": {
"terms": {
"field": "_index"
}
}
}
}

How to check current size of the ealastisearch queues defined in threadpool.XXX.queue_size?

How to check current size of the ealastisearch queues defined in threadpool.index.queue_size and threadpool.bulk.queue_size?
There is already some topic/question at SO related to the problems with queueing in the Elasticsearch: ElasticSearch gives error about queue size
This is about how to set queue sizes. But how to read the current (in real time) size/load on those queues to avoid in runtime overloading Elasticsearch and or to see if they are optimally used and the lengths are properly set?
I had tried to find the answer in the CAT API, but there is nothing explicit related to those queues (perhaps I do not see anything).

I got the answer in at Elasticsearch forum: How to check current size of the ealastisearch queues defined in threadpool.XXX.queue_size?
The REST call:
curl -XGET "https://server:port/_nodes/thread_pool?v"
will return JSON and in the path nodes/thread_pool/index and nodes/thread_pool/bulk will be located information about queue sizes.
Something like:
"index" : {
"type" : "fixed",
"min" : 2,
"max" : 2,
"queue_size" : 1000
},
More on that topic can be found at: Nodes Stats

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio