I am using codahale metrics (now dropwizard metrics) to monitor a few 'events' happening in my system. I am using the counters metrics to keep track of number of time the 'event' happened.
I checked the values printed by the reporter for my counter metrics and it seems like the value keeps on increasing (and never goes down). This seems logical as I am always using metrics.inc() function whenever my 'event' occurs.
What I really want is to get count of my 'event' happening between two reporting times, for this I need to reset my counter every time I report my metrics, but I couldn't find any option in counter metrics to do that. Is there a way or general practice followed by codahale users to produce such metrics?
Current Behavior (reporting time 10 sec):
00:00:00 0
00:00:10 2 // event happened twice
00:00:20 2 // event did not occur
00:00:30 5 // event occured three times`
Expected metrics:
00:00:00 0
00:00:10 2
00:00:20 0
00:00:30 3
To sum up or calculate count(total) per arbitrary interval:
hitcount(perSecond(your.count), '1day')
Afaik it does all the black magic inside. Including but not limited to summarize(scaleToSeconds(nonNegativeDerivative(your.count),1), '1day')
and also there should be scaling according to carbon's retention periods (one or many) that fall into chosen aggregation interval.
I believe that counter is not correct metrics for your case. Consider using meter that will provide you rate per time interval:
while(...) {
int stuffProcesssed = doStuff();
meter.mark(stuffProcesssed);
}
Related
I got the metrics using prometheus and webclient.
like ..
http_client_requests_seconds_count{clientName="aaa.com", ..., uri="/test"} 5
http_client_requests_seconds_max{clientName="aaa.com", ..., uri="/test"} 0
http_client_requests_seconds_sum{clientName="aaa.com", ..., uri="/test"} 10
I want to know what a each metrics mean.
And Time Unit.. 'http_client_requests_seconds_sum' is milli seconds? nano seconds? seconds?
'http_client_requests_seconds_max' mean longest time?
plz help me....!
http_client_requests_seconds_count is the total number of requests your application made to this endpoint (don’t worry about the fact that the name contains the word seconds).
http_client_requests_seconds_sum is the sum of the duration of every request your application made to this endpoint.
http_client_requests_seconds_max is the maximum request duration during a time window. The value resets to 0 when a new time window starts. The default time window is 2 minutes.
Reference: Spring Boot default metrics
I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period
I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :
expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
for: 1h
My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0
I can't really play around with scrape_intervals, which is set at 30s.
I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.
Answering your questions
It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
Increase is the right function. It even takes into account reset of the counter.
Answering if there is a better way to do it.
Regarding your expression, you can have the same result, without for clause, with:
expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.
Alerting longer
If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;
expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.
If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:
expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5
I have a simple spring boot app with the following config (the project is available here on GitHub):
management:
metrics:
export:
simple:
mode: step
endpoints:
web:
exposure:
include: "*"
The above config creates SimpleMeterRegistry and configures its metrics to be step-based, with 60 seconds step. I have one script that sends 50-100 requests per second to the service dummy endpoint and there's the other script that polls the data from /actuator/metrics/http.server.requests every X seconds. When I run the latter script every 60 seconds everything works as expected, but when the script is run every 120 seconds, the response always contains zeros for TOTAL_TIME and COUNT metrics.
Can anyone explain this behavior?
I have read the documentation here. The picture below
could indicate that a registry will try to aggregate the data for the previous interval only if pollAsRate is called during the current interval. This will explain why it does not work for 120 seconds interval. But this is just my assumption, does anyone know what is really happening here?
Spring boot version: 2.1.7.RELEASE
UPDATE
I did a similar test with management.metrics.export.simple.step=10s, it works fine when polling interval is 10s and not working when it is 20s. For 15s interval it sporadically works. So, it's definitely related to the step size and polling frequency.
MAX, TOTAL_TIME, COUNT is the property of Statistic.
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)) which sets the some measutement to 0 if there is no request has been made for last 2 minutes (120 seconds)
Methods such as public TimeWindowMax(Clock clock,...), private void rotate() has been written for the same. You may see the implementation here
More Detailed Answer
Finally figured out what is happening.
On every request to /actuator/metrics, MetricsEndpoint is going to merge measures (see here). That is done by collecting values for all meters with measurement.getValue(). The StepMeasurement.getValue() will not simply return the value, it will update the current and the previous intervals and counts, and roll the count (see here and here).
StepMeasurement.getValue
public double getValue() {
double absoluteCount = (Double)this.f.get();
double inc = Math.max(0.0D, absoluteCount - this.lastCount.sum());
this.lastCount.add(inc);
this.value.getCurrent().add(inc);
return this.value.poll();
}
StepDouble.poll
public double poll() {
rollCount(clock.wallTime());
return previous;
}
How is this related to the polling interval? If you do not poll /actuator/metrics endpoint, the current and previous intervals will not be updated, thus resulting in the current interval not being up-to-date and metrics being recorded for the "wrong" interval.
I have state change duration data between my object state in milliseconds.I am sending this data to graphite. I want to create a single stat panel which show me the percentage of the duration less than 20 seconds. How can I create it? Any idea or any similar scenario example will be useful.
myProjectName.FromStateToState.duration 10000ms
myProjectName.FromStateToState.duration 15000ms
myProjectName.FromStateToState.duration 21000ms
myProjectName.FromStateToState.duration 25000ms
myProjectName.FromStateToState.duration 30000ms
Assume for above scenario I expect my percentage should be %40. Because I have 5 duration data and 2 of them is less than 20 seconds. I am using Graphite as data source and Grafana as visualizing.
Temporary Solution
Because I couldn't get enough attention and any answer, I will add my temprorary solution to here. If I learn exact solution in the future I will post as an answer too.
Basically I created two counter like counterSuccess and counterFail. If state change duration is less than 20 seconds increase counterSuccess otherwise increase counterFail. Then get percentage of the success rate via following basic formula counterSuccess/(counterSuccess + counterFail).
Graphite commands at Grafana Panel:
A : sumSeries(myProjectName.FromStateToState.counterSuccess.count)
B : sumSeries(myProjectName.FromStateToState.counterFail.count)
C : sumSeries(#A, #B)
D : divideSeries(#A,#C)
I defined a single stat at grafana to show it as single percentage;
I need to collect event logs from Windows those are logged before 10 seconds. Using pull subscription I could collect already saved logs before execution of program and saving logs while program is running. I tried with the code available on MSDN:
Subscribing to Events
"I need to start to collect the event logged 10 seconds ago". Here I think I need to set value for LPWSTR pwsQuery to achieve that.
L"*[System/Level= 2]" gives the events with level equal to 2.
L"*[System/EventID= 4624]" gives events with eventID is 4624.
L"*[System/Level < 1]" gives events with level < 2.
Like that I need to set the value for pwsQuery to get event logged near 10 seconds. Can I do in the same way as above? If so how? If not what are the other ways to do it?
EvtSubscribe() gives you new events as they happen. You need to use EvtQuery() to get existing events that have already been logged.
The Consuming Events documentation shows a sample query that retrieves events beginning at a specific time:
// The following query selects all events from the channel or log file where the severity level is
// less than or equal to 3 and the event occurred in the last 24 hour period.
XPath Query: *[System[(Level <= 3) and TimeCreated[timediff(#SystemTime) <= 86400000]]]
So, you can use TimeCreated[timediff(#SystemTime) <= 10000] to get events in the last 10 seconds.
The TimeCreated element is documented here:
TimeCreated (SystemPropertiesType) Element
The timediff() function is described on the Consuming Events documentation:
The timediff function is supported. The function computes the difference between the second argument and the first argument. One of the arguments must be a literal number. The arguments must use FILETIME representation. The result is the number of milliseconds between the two times. The result is positive if the second argument represents a later time; otherwise, it is negative. When the second argument is not provided, the current system time is used.