I'm having some difficulties setting up an elastalert rule. It's quite a basic one, and I've read the documentation but clearly not understood it and I'm after some help.
I have a basic test rule that i want to alert when my data input to elastic from certain devices stops for more that 5 minutes.
es_host: localhost
es_port: 9200
name: Example rule
type: flatline
index: test_mapping-*
threshold: 1
timeframe:
minutes: 5
filter:
- term:
device: "ggYthy767b"
alert:
- command
command: ["/bin/test"]
realert:
minutes: 10
This works, so when data stops i get an alert, then that alert is silenced until 10 minutes later it realerts again. The issue is that it realerts every 10 minutes and i don't know how to stop it. Is there a way to get it to realert just once and then stop? Or have i misunderstood? Also I have 10+ different devices, and i want the same alert to apply if any of them stop sending data for 5 minutes, is that possible within one rule? Thanks very much in advance.
The question you need to ask to yourself is how often do you want to get alerted. Once a lifetime, a year, a month or fortnightly or what? So "realert" is the part you want to edit. You might want to change it to something like below. So even if the alert is triggered multiple times you'll only get it once a day. It uses simple English terms so you can update it how you like it (weeks, hours etc.).
realert:
days: 1
But if you're getting alerted much more than you want, either you're system is too unstable or your alerts are too paranoid. For example for this alert every 5 minutes you're looking for one record which actually doesn't get populated. You should raise your period or add less selective filters because it's a 'flatline' alert. You can also use it with "query_key" so it will be applied on a per key basis.
Related
i am new to pinescript but i am facing an huge problem when i try to work with time.
The problem is:
I have an indicator that shows buy (or sell) on the current candlestick but sometimes it disappear. I wanted to add a control where if the buy is true for more than 15 second (average time where the signal will no longer disappear), the alert will send me a notification.
I tried so many things like putting the 'timenow' in a variable and compare as follow
BuyTime := timenow
If timenow - BuyTime >= 15000
RealBuy = true
And many more things that i dont perfectly remember..
I dont know why but i wasn't able to get it to work, any suggestion please?
Thanks in advance
My Oracle DBA have setup a task with following repeat_interval:
Start Date :"30/JAN/20 08:00AM"
Repeat_interval: "FREQ=DAILY; INTERVAL=0; BYMINUTE=15"
Can I ask what is "Interval=0" means?
Does it means this task will run daily from 8AM, and will repeat every 15 mins until success?
I tried to get the answer from Google, but what I find is what is Interval=1, but nothing for 0.
So would be great if anyone can share me some light here.
Thanks in advance!
INTERVAL is the number of increments of the FREQ value between executions. I believe in this case that a value of 0 or 1 would be the same. The schedule as shown would execute once per day (FREQ=DAILY), at approximately 15 minutes past a random hour (BYMINUTE=15, but BYHOUR and BYSECOND are not set).
Schedule has nothing to do with whether or not the previous execution succeeded or not. Start Date is only the date at which the job was enabled, not when it actually starts processing.
If you want it to run every 15 minutes from the moment you enable it, you should set as follows:
FREQ=MINUTELY; INTERVAL=15
If you want it to run exactly on the quarter hour, then this:
FREQ=MINUTELY; BYMINUTE=0,15,30,45; BYSECOND=0
If you want it to run every day at 8am, then this:
FREQ=DAILY; BYHOUR=8; BYMINUTE=0; BYSECOND=0
I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period
I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :
expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
for: 1h
My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0
I can't really play around with scrape_intervals, which is set at 30s.
I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.
Answering your questions
It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
Increase is the right function. It even takes into account reset of the counter.
Answering if there is a better way to do it.
Regarding your expression, you can have the same result, without for clause, with:
expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.
Alerting longer
If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;
expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.
If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:
expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5
Context
I have multiple servers listening to a specific collection (/items). Each of them use NTS for time calibration and the ".info/serverTimeOffset" to measure the expected time difference with Firebase. It is consistently around 20ms.
I have many clients pushing items to the collection with the specific field:
{
...
created: Firebase.database.ServerValue.TIMESTAMP
}
What is expected:
When the server receives the item from Firebase and subtracts the item.created with the Firebase expected time (Date.now() + offset), this value should be positive and probably around 10ms (time for the item to be sent from Firebase to the server).
What is happening:
When the server receives the items, the item.created field is superior to the Firebase expected time. Like it was created in the future. Usually the difference is around -5ms
Question:
What is the Firebase.database.ServerValue.TIMESTAMP set to ? and how is it related to the ".info/serverTimeOffset" ?
The 27th September 2016 at 1am UTC, that difference jumped from -5ms to around -5000ms like a kind of re-calibration happened (it lasted until I reset the .info/serverTimeOffset) Did someone experienced something similar?
The Cube software (https://github.com/square/cube) allows you to retrieve events.
I want to retrieve a lot of events. But it appears that I am capped at 1000. There are well over 9000 in mongodb in the collection and time range I am querying
Example http GET queries I issue:
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&start=2012-02-02&stop=2013-07-03
# 7 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=7
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=9999
It appears that the limit is pinned:
https://github.com/square/cube/blob/28dad4af27a6680deb46077b16952590f2c21cad/lib/cube/event.js
Line 166
based on the 'batchSize=1000'
Is it possible that you can 'page' through the data in some way? Or is this just a hard limit?
Looks like there is a hard cap on results in three places that need to be updated for large domains:
event.js - line 166
metric.js - line 11
metric.js - line 12
In addition, I was unable to find any query-string apis for the parameters. Ideally, we can leave the cap at 1000 (to avoid server bloat for people not tuning their queries correctly) and allow the consumer to define override behavior.