Elasticsearch timeout true but still get result - elasticsearch

I'm setting the timeout to 10ms to my search query, so I'm expecting that elasticsearch search query should timeout in 10ms.
In the response, I do get "timed_out":true but the query doesnt seem to timeout. It still runs for a few hundred milliseconds.
Sample response:
{
"took": 460,
"timed_out": true,
....
Is this the expected behavior or am I missing something here ? My goal is to terminate the query if its taking too long so that it doesnt put load on the cluster.

What to expect from query timeout?
Elasticsearch query running with timeout set may return partial or empty results (if timeout has expired), from the Elasticsearch Guide:
The timeout parameter tells shards how long they are allowed to
process data before returning a response to the coordinating node. If
there was not enough time to process all data, results for this shard
will be partial, even possibly empty.
The documentation of the Request Body Search parameters also tells this:
timeout
A search timeout, bounding the search request to be executed within
the specified time value and bail with the hits accumulated up to that
point when expired. Defaults to no timeout.
For further details please consult this page in the guide.
How to terminate queries that run too long?
Looks like Elasticsearch does not have an ultimate answer, rather several workarounds for particular cases. Here they are.
There isn't a way to protect system from DoS attacks (as of year 2015). Long-running queries can be limited with timeout or terminate_after query parameters. terminate_after is like timeout but it counts the number of documents per shard. Both of these parameters are more like recommendations to Elasticsearch, means that some long-running queries can still pass through the desired max execution time (like a script query for instance).
Since then Task Management API was introduced and monitoring and cancelling long-running tasks became possible. This means that you will have to write some additional code that will check the health of the cluster and cancel the tasks.

Related

Elastic Reindex does not copy all documents

We are using the elastic reindex api at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
However, sometimes this job abruptly gives up the copying randomly and returns with completed:true status. Also, there are no running background tasks as well if we check via _cat/tasks?v=true&detailed=true
We take some actions when reindex is complete (making it active) however the data is not complete and causing search issues.
Usually, the expectation is that the total = created + version conflicts (we use operation type as create) when completed flag is true.
Any idea why all documents are not being copied sometimes and/or the reindexing task gives up midway with a false completed status ?
Note: This does not happen always and could be related to a slightly higher load as well.

Elasticsearch query that is guaranteed to time out

Can someone help me craft an Elasticsearch query that is likely to time out on a few thousand records/documents? I would like to see what is actually returned when an aggregation request times out. Is this documented anywhere?
My attempt so far:
POST /myindex/_search?size=0
{
"aggs" : {
"total-cost" : {
"sum" : {
"field" : "cost",
"missing": 1
}
}
}
}
The reason for this question is sometimes in production I get a response that's missing the "total-cost" aggregation. I have a hunch it might be due to timeouts. That's why I want to see what is returned exactly when a request times out.
I've also looked at how to set the request timeout in the Kibana console, and apparently there is no way to do this.
NB. I am talking about search timeouts, not connection timeouts.
As per my understanding, query_timeout will not work as expected in Elasticsearch. Because there are few reasons for it.
Elasticsearch execute query in two phase when you send request to the cluster. One phase is Query Phase and second is Fetch Phase. So when you specify timeout, this does cause elastic to return a partial response after the timeout has elapsed (ish), it doesn't prevent the server from finishing the query execution and is therefore no use in limiting server load.
Please check warning in timeout documentation.
It’s important to know that the timeout is still a best-effort
operation; it’s possible for the query to surpass the allotted
timeout. There are two reasons for this behavior:
Timeout checks are performed on a per-document basis. However, some
query types have a significant amount of work that must be performed
before documents are evaluated. This "setup" phase does not consult
the timeout, and so very long setup times can cause the overall
latency to shoot past the timeout.
Because the time is once per document, a very long query can execute on a single document and it
won’t timeout until the next document is evaluated. This also means
poorly written scripts (e.g. ones with infinite loops) will be allowed
to execute forever.
Now, you might have question that in this scenario cluster will be going down or OutOfMoemory exception will be occurs. So in this scenario you can handle this with circuit Breakers settings.
Please check github issue #60037

Gremlin query via HTTP is extremely slow

So, I'm running two very simple gremlin queries through both the Gremlin Console and via an HTTP request (issued from the same machine as the Gremlin Server resides on). The queries look like this:
First query:
console: g.V(127104, 1069144, 590016, 200864).out().count()
http: curl -XPOST -Hcontent-type:application/json -d '{"gremlin":"g.V(127104, 1069144, 590016, 200864).out().count()}' http://localhost:8182
Second query:
console: g.V(127104, 1069144, 590016, 200864).out().in().dedup().count()
http: curl -XPOST -Hcontent-type:application/json -d '{"gremlin":"g.V(127104, 1069144, 590016, 200864).out().in().dedup().count()}' http://localhost:8182
It is by no means a huge graph - the first query returns 750 and the second query returns 9154. My problem is that I see huge performance differences between the queries run via HTTP compared to the console. For the first query both the console and the HTTP request returns immediately and looking at the gremlin server log, I'm please to see that the query takes only 1-2 milliseconds in both cases. All is good.
Now for the second query, the picture changes. While the console continues to provide the answer immediately, it now takes between 4 and 5 seconds (!!) for the HTTP request to return the answer! The server log reports roughly the same execution time (some 50-60 ms) for both executions of the second query, so what is going on? I'm only doing a count(), so the slow HTTP response cannot be a serialization issues - it only needs to return a number, just as in the first query.
Does anyone have any good ideas?
UPDATE:
Running profile() gives some interesting results (screen shots attached below). It looks like everything runs way slower when called via HTTP, which to me makes no sense...
From console:
Via HTTP request:
With the help of #stephen mallette I managed to find the answer to this question. It turns out that the console - which runs in a session - caches answers to queries, so when I queried the same ids multiple times the console simply retrieved the answer from the cache and didn't actually query Dynamo at all. HTTP on the other hand runs sessionless, so each query over HTTP was hitting Dynamo. Needless to say - retrieving a result from a cache is much, much faster than having to query Dynamo.
In order to force the query to hit Dynamo in the console, I have added a g.tx().rollback() after each query execution and the query now runs in comparable time whether I use the console or query via HTTP. Unfortunately it's rather slow in my opinion, but that's probably a topic for a different question :)
UPDATE: The reason for the slow response times with Dynamo was due to read/write rate-limiting that had been added to keep cost of Dynamo down. When increasing the rate-limits significantly, the query ran much faster. This, unfortunately, gets too expensive for me going forward, so I have now switched to running with Cassandra as backend instead, which also gets me excellent response times :)

ElasticSearch Timeout to Cancel Expensive Long-Running Queries

In Elasticsearch 2.x is there a way to cancel queries after a certain timeout to reduce server load ie circuit break based on request time?
Specifically if you send expensive queries ie large number of wildcards or complex aggregations, these could effectively bring down a cluster - the cluster is unable to service new requests. Is there a server level config that times out these queries after a certain amount of time?
The timeout parameter in the request body is best effort and just specifies the time that you receive a http response... the query is still in flight and will consume cluster resources until it finishes executing. The circuit breaker option https://www.elastic.co/guide/en/elasticsearch/reference/2.3/circuit-breaker.html is the behavior I would like to see, but based on a request timeout instead of by amount of heap used.

Parse.com - Performance problems with 100K users

We have an a Parse application with about 100K users.
Our queries on the user tables timeout.
For example, I'm doing the following query:
var query = new Parse.Query(Parse.User);
query.exists("email");
query.find(...);
This query will timeout. If I limit the results to a low number, e.g. 10, I can get the first 10 results. But the next pages will timeout. I.e. this will timeout:
query.limit(10);
query.skip(500);
query.find(...);
We are currently at a situation where we are not able to manage our users. Whenever we try to get a list of users by some attribute or change something for a batch of users we get timeout.
We tried doing the queries in cloud code and using the javascript sdk. Both methods fail with timeouts eventually.
Am I doing something wrong or is it a Parse limitation?
Parse cloud functions have a timeout of 15 seconds, and before/after save triggers have a timeout of 3 seconds.
If you need more time, you should find a way to do what you need done in a background job rather than a cloud function. Those have 15 minute timers, which is more than enough to do anything reasonable, and anything that requires more time, you'll have to find a way to save where you left off, and have the function run multiple times until everything you wanted to do is complete.

Resources