Elastic Reindex does not copy all documents - elasticsearch

We are using the elastic reindex api at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
However, sometimes this job abruptly gives up the copying randomly and returns with completed:true status. Also, there are no running background tasks as well if we check via _cat/tasks?v=true&detailed=true
We take some actions when reindex is complete (making it active) however the data is not complete and causing search issues.
Usually, the expectation is that the total = created + version conflicts (we use operation type as create) when completed flag is true.
Any idea why all documents are not being copied sometimes and/or the reindexing task gives up midway with a false completed status ?
Note: This does not happen always and could be related to a slightly higher load as well.

Related

Elasticsearch query that is guaranteed to time out

Can someone help me craft an Elasticsearch query that is likely to time out on a few thousand records/documents? I would like to see what is actually returned when an aggregation request times out. Is this documented anywhere?
My attempt so far:
POST /myindex/_search?size=0
{
"aggs" : {
"total-cost" : {
"sum" : {
"field" : "cost",
"missing": 1
}
}
}
}
The reason for this question is sometimes in production I get a response that's missing the "total-cost" aggregation. I have a hunch it might be due to timeouts. That's why I want to see what is returned exactly when a request times out.
I've also looked at how to set the request timeout in the Kibana console, and apparently there is no way to do this.
NB. I am talking about search timeouts, not connection timeouts.
As per my understanding, query_timeout will not work as expected in Elasticsearch. Because there are few reasons for it.
Elasticsearch execute query in two phase when you send request to the cluster. One phase is Query Phase and second is Fetch Phase. So when you specify timeout, this does cause elastic to return a partial response after the timeout has elapsed (ish), it doesn't prevent the server from finishing the query execution and is therefore no use in limiting server load.
Please check warning in timeout documentation.
It’s important to know that the timeout is still a best-effort
operation; it’s possible for the query to surpass the allotted
timeout. There are two reasons for this behavior:
Timeout checks are performed on a per-document basis. However, some
query types have a significant amount of work that must be performed
before documents are evaluated. This "setup" phase does not consult
the timeout, and so very long setup times can cause the overall
latency to shoot past the timeout.
Because the time is once per document, a very long query can execute on a single document and it
won’t timeout until the next document is evaluated. This also means
poorly written scripts (e.g. ones with infinite loops) will be allowed
to execute forever.
Now, you might have question that in this scenario cluster will be going down or OutOfMoemory exception will be occurs. So in this scenario you can handle this with circuit Breakers settings.
Please check github issue #60037

What happens if I never clean up Elasticsearch tasks?

The update by query docs says that with wait_for_completion=false a task will get created, to track progress, and that the task api should be used clean up the tasks afterwards.
What is the consequence of never cleaning up these old tasks, or doing so very infrequently? Is the cost only the disk space these task files take up?
Yes, it's not a big deal if you don't cleanup those tasks immediately. The .tasks index usually has one primary shard, which allows you to spawn up to 2B tasks (= 2^31, i.e. maximum number of docs per shard) before getting into trouble.
If you use them to keep track of your tasks, it's better to clean them up once they are done, otherwise you might end up with a mess of finished task documents that are not easy to sort out.
That can also be taken care of by a simple cron job that periodically runs
DELETE .tasks/_delete_by_query?q=*

Trains: Can I reset the status of a task? (from 'Aborted' back to 'Running')

I had to stop training in the middle, which set the Trains status to Aborted.
Later I continued it from the last checkpoint, but the status remained Aborted.
Furthermore, automatic training metrics stopped appearing in the dashboard (though custom metrics still do).
Can I reset the status back to Running and make Trains log training stats again?
Edit: When continuing training, I retrieved the task using Task.get_task() and not Task.init(). Maybe that's why training stats are not updated anymore?
Edit2: I also tried Task.init(reuse_last_task_id=original_task_id_string), but it just creates a new task, and doesn't reuse the given task ID.
Disclaimer: I'm a member of Allegro Trains team
When continuing training, I retrieved the task using Task.get_task() and not Task.init(). Maybe that's why training stats are not updated anymore?
Yes that's the only way to continue the same exact Task.
You can also mark it as started with task.mark_started() , that said the automatic logging will not kick in, as Task.get_task is usually used for accessing previously executed tasks and not continuing it (if you think the continue use case is important please feel free to open a GitHub issue, I can definitely see the value there)
You can also do something a bit different, and justcreate a new Task continuing from the last iteration the previous run ended. Notice that if you load the weights file (PyTorch/TF/Keras/JobLib) it will automatically connect it with the model that was created in the previous run (assuming the model was stored is the same location, or if you have the model on https/S3/Gs/Azure and you are using trains.StorageManager.get_local_copy())
previous_run = Task.get_task()
task = Task.init('examples', 'continue training')
task.set_initial_iteration(previous_run.get_last_iteration())
torch.load('/tmp/my_previous_weights')
BTW:
I also tried Task.init(reuse_last_task_id=original_task_id_string), but it just creates a new task, and doesn't reuse the given task ID.
This is a great idea for an interface to continue a previous run, feel free to add it as GitHub issue.

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Elasticsearch timeout true but still get result

I'm setting the timeout to 10ms to my search query, so I'm expecting that elasticsearch search query should timeout in 10ms.
In the response, I do get "timed_out":true but the query doesnt seem to timeout. It still runs for a few hundred milliseconds.
Sample response:
{
"took": 460,
"timed_out": true,
....
Is this the expected behavior or am I missing something here ? My goal is to terminate the query if its taking too long so that it doesnt put load on the cluster.
What to expect from query timeout?
Elasticsearch query running with timeout set may return partial or empty results (if timeout has expired), from the Elasticsearch Guide:
The timeout parameter tells shards how long they are allowed to
process data before returning a response to the coordinating node. If
there was not enough time to process all data, results for this shard
will be partial, even possibly empty.
The documentation of the Request Body Search parameters also tells this:
timeout
A search timeout, bounding the search request to be executed within
the specified time value and bail with the hits accumulated up to that
point when expired. Defaults to no timeout.
For further details please consult this page in the guide.
How to terminate queries that run too long?
Looks like Elasticsearch does not have an ultimate answer, rather several workarounds for particular cases. Here they are.
There isn't a way to protect system from DoS attacks (as of year 2015). Long-running queries can be limited with timeout or terminate_after query parameters. terminate_after is like timeout but it counts the number of documents per shard. Both of these parameters are more like recommendations to Elasticsearch, means that some long-running queries can still pass through the desired max execution time (like a script query for instance).
Since then Task Management API was introduced and monitoring and cancelling long-running tasks became possible. This means that you will have to write some additional code that will check the health of the cluster and cancel the tasks.

Resources