we can use following endpoint to wait for some event:
GET /_cluster/health?wait_for_status=yellow&timeout=50s
Is it possible to wait for two conditions, something like
GET /_cluster/health?wait_for_status=yellow&wait_for_nodes=10&timeout=50s
?
Yes , it works on my cluster.
GET /_cluster/health?wait_for_status=yellow&wait_for_nodes=13&timeout=5s
Related
I'm using AWS Lambda to do a delete_by_query on an Elasticsearch index so I get rid of everything older than 7 days. That works, but I noticed that the count of the documents is the same before and after, so if I were to run a query in Elasticsearch I may not get correct results until the delete_by_query is completed.
I found this post (python 3.x - Right way to delete and then reindex ES documents - Stack Overflow) that states that it is "best to set wait_for_completion to False. In this case you'll get task details and will be able to track task progress." For one, I haven't found anything that states why this is the case, unless your delete takes 4 hours like that example.
I found code to determine if the delete_by_query is still running at this wonderful site here and tried:
es_client.tasks(detailed=True,actions="*/delete/byquery")
However, I'm getting the message that
'TasksClient' object is not callable.
I am not entirely sure if that is true or not , or if my syntax is incorrect and thus that is why it is not working. It doesn't make sense that I can't programmatically query Tasks with python if I can do it in the console and with curl.
If it is not good to set wait_for_completion to False, and I can't query this with Python, how am I to programmatically get any information about the task or an understanding as to whether I can go ahead with the analytical queries or whatever else I want to do that depends on this task being done?
Okay, I'm not entirely sure why you are getting that error, so I can't help with that in particular. But, I noticed that the python elasticsearch documentation on how to get the task id from the delete_by_query when wait_for_completion is set to false isn't very clear, so I'm going to provide this in case it helps.
from elasticsearch import Elasticsearch
es = Elasticsearch()
response = es.delete_by_query(index=someIndex, body=someQuery, wait_for_completion=False)
# get task id
print(response['task'])
Hope that helps!
I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period
I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :
expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
for: 1h
My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0
I can't really play around with scrape_intervals, which is set at 30s.
I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.
Answering your questions
It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
Increase is the right function. It even takes into account reset of the counter.
Answering if there is a better way to do it.
Regarding your expression, you can have the same result, without for clause, with:
expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.
Alerting longer
If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;
expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.
If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:
expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5
Assuming I have a long running update query where I am updating ~200k to 500k, perhaps even more.Why I need to update so many documents is beyond the scope of the question.
Since the client times out (I use the official ES python client), I would like to have a way to check what the status of the bulk update request is, without having to use enormous timeout values.
For a short request, the response of the request can be used, is there a way I can get the response of the request as well or if I can specify a name or id to a request so as to reference it later.
For a request which is running : I can use the tasks API to get the information.
But for other statuses - completed / failed, how do I get it.
If I try to access a task which is already completed, I get resource not found .
P.S. I am using update_by_query for the update
With the task id you can look up the task directly:
GET /_tasks/taskId:1
The advantage of this API is that it integrates with
wait_for_completion=false to transparently return the status of
completed tasks. If the task is completed and
wait_for_completion=false was set on it them it’ll come back with a
results or an error field. The cost of this feature is the document
that wait_for_completion=false creates at .tasks/task/${taskId}. It is
up to you to delete that document.
From here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query-task-api
My use case went like this, I needed to do an update_by_query and I used painless as the script language. At first I did a reindex (when testing). Then I tried using the update_by_query functionality (they resemble each other a lot). I did a request to the task api (the operation hasn't finished of course) and I saw the task being executed. When it finished I did a query and the data of the fields that I was manipulating had disappeared. The script worked since I used the same script for the reindex api and everything went as it should have. I didn't investigate further because of lack of time, but... yeah, test thoroughly...
I feel GET /_tasks/taskId:1 confusing to understand. It should be
GET http://localhost:9200/_tasks/taskId
A taskId looks something like this NCvmGYS-RsW2X8JxEYumgA:1204320.
Here is my trivial explanation related to this topic.
To check a task, you need to know its taskId.
A task id is a string that consists of node_id, a colon, and a task_sequence_number. An example is taskId = NCvmGYS-RsW2X8JxEYumgA:1204320 where node_id = NCvmGYS-RsW2X8JxEYumgA and task_sequence_number = 1204320. Some people including myself thought taskId = 1204320, but that's not the way how the elasticsearch codebase developers understand it at this moment.
A taskId can be found in two ways.
wait_for_deletion = false. When sending a request to ES, with this parameter, the response will be {"task" : "NCvmGYS-RsW2X8JxEYumgA:1204320"}. Then, you can check a status of that task like this GET http://localhost:9200/_tasks/NCvmGYS-RsW2X8JxEYumgA:1204320
GET http://localhost:9200/_tasks?detailed=false&actions=*/delete/byquery. This example will return you the status of all tasks with action = delete_by_query. If you know there is only one task running on ES, you can find your taskId from the response of all running tasks.
After you know the taskId, you can get the status of a task with this.
GET /_tasks/taskId
Notice you can only check the status of a task when the task is running, or a task is generated with wait_for_deletion == false.
More trivial explanation, wait_for_deletion by default is true. Based on my understanding, tasks with wait_for_deletion = true are "in-memory" only. You can still check the status of a task while it's running. But it's completely gone after it is completed/canceled. Meaning checking the status will return you a 'resouce_not_found_exception'. Tasks with wait_for_deletion = false will be stored in an ES system index .task. You can still check it's status after it finishes. However, you might want to delete this task document from .task index after you are done with it to save some space. The deletion request looks like this
http://localhost:9200/.tasks/task/NCvmGYS-RsW2X8JxEYumgA:1204320
You will receive resouce_not_found_exception if a taskId is not present. (for example, you deleted some task twice, or you are deleting an in-memory task, whose wait_for_deletetion == true).
About this confusing taskId thing, I made a pull request https://github.com/elastic/elasticsearch/pull/31122 to help clarify the Elasticsearch document. Unfortunately, they rejected it. Ugh.
I'm having some difficulties setting up an elastalert rule. It's quite a basic one, and I've read the documentation but clearly not understood it and I'm after some help.
I have a basic test rule that i want to alert when my data input to elastic from certain devices stops for more that 5 minutes.
es_host: localhost
es_port: 9200
name: Example rule
type: flatline
index: test_mapping-*
threshold: 1
timeframe:
minutes: 5
filter:
- term:
device: "ggYthy767b"
alert:
- command
command: ["/bin/test"]
realert:
minutes: 10
This works, so when data stops i get an alert, then that alert is silenced until 10 minutes later it realerts again. The issue is that it realerts every 10 minutes and i don't know how to stop it. Is there a way to get it to realert just once and then stop? Or have i misunderstood? Also I have 10+ different devices, and i want the same alert to apply if any of them stop sending data for 5 minutes, is that possible within one rule? Thanks very much in advance.
The question you need to ask to yourself is how often do you want to get alerted. Once a lifetime, a year, a month or fortnightly or what? So "realert" is the part you want to edit. You might want to change it to something like below. So even if the alert is triggered multiple times you'll only get it once a day. It uses simple English terms so you can update it how you like it (weeks, hours etc.).
realert:
days: 1
But if you're getting alerted much more than you want, either you're system is too unstable or your alerts are too paranoid. For example for this alert every 5 minutes you're looking for one record which actually doesn't get populated. You should raise your period or add less selective filters because it's a 'flatline' alert. You can also use it with "query_key" so it will be applied on a per key basis.
I'm currently coding an app that utilizes Parse as a backend, but have run into a '124' error. I admit that I do a lot in my cloud functions, but, from what I've observed, it doesn't appear over 15 seconds. Could someone please confirm this? Below is the output.
E2015-03-06T03:49:52.644Z] v286: Ran cloud function createEvent for
user puZNjFVfSm with:
Input:
{"RSVPDate":{"__type":"Date","iso":"2015-03-06T04:49:52.000Z"},"description":"Sample event to showcase
functionality","group":{"max":5,"min":4},"max":50,"reoccur":{"day":1,"month":1,"stop":{"__type":"Date","iso":"2015-03-06T04:49:52.000Z"},"week":1},"title":"SampleFCFS"}
Failed with: Execution timed out I2015-03-06T03:49:52.716Z] begin
I2015-03-06T03:49:52.717Z] creating Event - initial checks completed
I2015-03-06T03:49:52.718Z] Finished advanced checks
I2015-03-06T03:49:52.719Z] Event creation start
I2015-03-06T03:49:52.770Z] begin event creation
I2015-03-06T03:49:52.873Z] Finding role: company_employee_z0Zx39OyuY
I2015-03-06T03:49:52.875Z] Added and secured event
I2015-03-06T03:49:52.931Z] attaching role to 425Qy9v9e4
I2015-03-06T03:49:52.934Z] Adding participant
From what I can tell, it looks like I'm only getting 300Z (is that milliseconds?) on all my runs. Shouldn't I be getting 15 seconds?
Update: I found that the issue was caused by using the addUnique function of Parse Objects with an array of pointers. By inserting ids instead of pointers, the issue was resolved.
Thank you for your help.