I have a prometheus counter (spring_batch_job_seconds_count{status=~'FAILED'}) that counts job failures. I want to graph job failures over time and alert on job failures. The increase function gives me what I want except for the first occurrence. The counter is not published until a failure occurs, so there is no increase (or delta or rate) on the first failure event since there is no previous counter value of 0 to compare the first non-zero counter value to. How can I create a graph that will show the first failure occurrence (as well as subsequent failure occurrences) and a corresponding alert that will trigger on the first failure occurrence (as well as future failure occurrences)? I might be willing to settle for two alerts: one that triggers when the counter increments, and one that triggers on the first occurrence, but I would not want to have to manually shut off the alert that triggers on the first occurrence after it triggers for the first time.
I managed to do this with falco metrics.
I want to alert on any change, even the first time a metric appears.
(sum(falco_events{k8s_pod_name="runner"} or falco_events{} * 0) by (k8s_pod_name, rule) - sum(falco_events{k8s_pod_name="runner"} offset 5m or falco_events{} * 0) by (k8s_pod_name, rule))
Workaround from here: https://github.com/prometheus/prometheus/issues/1673
Related
Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.
I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period
I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :
expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
for: 1h
My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0
I can't really play around with scrape_intervals, which is set at 30s.
I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.
Answering your questions
It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
Increase is the right function. It even takes into account reset of the counter.
Answering if there is a better way to do it.
Regarding your expression, you can have the same result, without for clause, with:
expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.
Alerting longer
If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;
expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.
If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:
expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5
I need to collect event logs from Windows those are logged before 10 seconds. Using pull subscription I could collect already saved logs before execution of program and saving logs while program is running. I tried with the code available on MSDN:
Subscribing to Events
"I need to start to collect the event logged 10 seconds ago". Here I think I need to set value for LPWSTR pwsQuery to achieve that.
L"*[System/Level= 2]" gives the events with level equal to 2.
L"*[System/EventID= 4624]" gives events with eventID is 4624.
L"*[System/Level < 1]" gives events with level < 2.
Like that I need to set the value for pwsQuery to get event logged near 10 seconds. Can I do in the same way as above? If so how? If not what are the other ways to do it?
EvtSubscribe() gives you new events as they happen. You need to use EvtQuery() to get existing events that have already been logged.
The Consuming Events documentation shows a sample query that retrieves events beginning at a specific time:
// The following query selects all events from the channel or log file where the severity level is
// less than or equal to 3 and the event occurred in the last 24 hour period.
XPath Query: *[System[(Level <= 3) and TimeCreated[timediff(#SystemTime) <= 86400000]]]
So, you can use TimeCreated[timediff(#SystemTime) <= 10000] to get events in the last 10 seconds.
The TimeCreated element is documented here:
TimeCreated (SystemPropertiesType) Element
The timediff() function is described on the Consuming Events documentation:
The timediff function is supported. The function computes the difference between the second argument and the first argument. One of the arguments must be a literal number. The arguments must use FILETIME representation. The result is the number of milliseconds between the two times. The result is positive if the second argument represents a later time; otherwise, it is negative. When the second argument is not provided, the current system time is used.
Can't get the max_failures idea. From the documentation:
This attribute specifies the number of times a job can fail on consecutive scheduled runs before it is automatically disabled.
So, let's suppose I have a schedule. Its running count is 100. Its failure count is 18. Its max failures is 20.
Current run has finished successfully.
I expect: if I will break it - it will run exactly 20 times on state FAILED after which it will be changed to BROKEN
What I get: it runs 2 times so failure count is 20 and despite the fact it were just 2 consecutive runs the schedule is changed to state BROKEN.
What have I missed?
I think "consecutive scheduled runs" means exactly that. If it succeeds, the failure count should be reset to 0.
EDIT
Guess I was wrong, sorry.
Reading up: http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/schedadmin004.htm
As per Gary's comment - looks like you need to reset the failure count manually.
In general I am writing rules for events which equal (by attributes values) events can occur any time in consecutive manner (every second). I want to fire rules for matched events only on an hourly bases.
In more details:
I want to fire a rule when an event is inserted for the first time (not exist yet) OR when an event is inserted and if and only if equal events are already inserted to the working memory BUT the newest of them is at least one hour ago old.
What is a reasonable way of writing a rule of that kind, taking events duration will be 24H?
rule X
when
$e : MyEvent() from entry-point "s"
not( MyEvent( this != $e, id == $e.id, this before[0s,1h] $e ) from entry-point "s" )
then
// $e arrived and there is no other event with the same
// id that happened during the last hour
end
Replace "id == $e.id" by whatever constraints you use to decide two events are related to each other.
You could create a global queue like this:
global java.util.List eventQueue;
Your also need to access your global queue from java, so just use:
session.getGlobals();
session.setGlobal(name, value);
In this queue save an event and related time. Then check hourly form java code this queue, and execute rules based on timestamp. This is not poor drools approach, but is straightforward.