Firing many alerts from one rule by metric label - label

There is a way to fire many alerts from same rule using the metric label?
I have a prometheus counter metric with a label “client”.
How I configure one rule to fire for each client that separately satisfy a fire condition?
My version is 8.4.2

This is exactly the way alerts work in Prometheus. It will generate one alert for each label combination that satisfies a fire condition.
For example, the following rule:
- alert: InstanceIsDown
expr: probe_success{job="blackbox"} == 0
annotations:
summary: 'Instance {{ $labels.instance }} is down'
The Blackbox "probe_success" metric has an "instance" label. If the instances "xxx" and "yyy" are down, the rule will generate two alerts, one for each instance.

Related

CloudWatch Event Rule: Disable itself

I have a cloudwatch event-rule which runs a lambda-function and is configured f.e. like cron(0/2 21-22 15 4 ? 2020).
Inside the lambda function it checks on something. If that condition is fulfilled, I want to disable the rule. I cannot do it via
new CloudWatchEvents().disableRule({Name: 'MyRule'})
because of concurrency. Is there any other way to achieve it comfortably?

Firing Alerts for an activity which is supposed to happen during a particular time interval(using Prometheus Metrics and AlertManager)

I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period
I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :
expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
for: 1h
My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0
I can't really play around with scrape_intervals, which is set at 30s.
I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.
Answering your questions
It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
Increase is the right function. It even takes into account reset of the counter.
Answering if there is a better way to do it.
Regarding your expression, you can have the same result, without for clause, with:
expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.
Alerting longer
If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;
expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.
If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:
expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5

Waiting for more than one conditional in ElasticSearch:

we can use following endpoint to wait for some event:
GET /_cluster/health?wait_for_status=yellow&timeout=50s
Is it possible to wait for two conditions, something like
GET /_cluster/health?wait_for_status=yellow&wait_for_nodes=10&timeout=50s
?
Yes , it works on my cluster.
GET /_cluster/health?wait_for_status=yellow&wait_for_nodes=13&timeout=5s

Elastalert constant realerting.

I'm having some difficulties setting up an elastalert rule. It's quite a basic one, and I've read the documentation but clearly not understood it and I'm after some help.
I have a basic test rule that i want to alert when my data input to elastic from certain devices stops for more that 5 minutes.
es_host: localhost
es_port: 9200
name: Example rule
type: flatline
index: test_mapping-*
threshold: 1
timeframe:
minutes: 5
filter:
- term:
device: "ggYthy767b"
alert:
- command
command: ["/bin/test"]
realert:
minutes: 10
This works, so when data stops i get an alert, then that alert is silenced until 10 minutes later it realerts again. The issue is that it realerts every 10 minutes and i don't know how to stop it. Is there a way to get it to realert just once and then stop? Or have i misunderstood? Also I have 10+ different devices, and i want the same alert to apply if any of them stop sending data for 5 minutes, is that possible within one rule? Thanks very much in advance.
The question you need to ask to yourself is how often do you want to get alerted. Once a lifetime, a year, a month or fortnightly or what? So "realert" is the part you want to edit. You might want to change it to something like below. So even if the alert is triggered multiple times you'll only get it once a day. It uses simple English terms so you can update it how you like it (weeks, hours etc.).
realert:
days: 1
But if you're getting alerted much more than you want, either you're system is too unstable or your alerts are too paranoid. For example for this alert every 5 minutes you're looking for one record which actually doesn't get populated. You should raise your period or add less selective filters because it's a 'flatline' alert. You can also use it with "query_key" so it will be applied on a per key basis.

Populating a Date/Time in Prometheus Alert Manager E-mail

So i have an alert rule that gets fired in prometheus when a queue length has been long for a certain period of time.
Through the alert manger, I am able to create and receive e-mails.
My question now is, as part of my e-mail body, I want to have the Date and Time that either the alert manager triggered the e-mail, or of when the alert was fired.
I am unsure how to do this. Whether I can create a label in the alert and populate it somehow with the current date/time, or what? Any ideas?
- alert: Alert
for: 5m
expr: ...
annotations:
timestamp: >
time: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }}
I still find iterating alerts and getting the value of timeseries or timestamp in alert text as difficult. So I have solved this problem in above way. It works, and I am able to get timestamp / timeseries of alert in email body. Cheers.!
The alerts in the Alertmanager templates have a StartsAt attribute you could use.

Resources