How to determine a windows process has restarted using PromQL and Grafana

How to determine a windows process has restarted using PromQL and Grafana - windows

I am using the 'windows_exporter' metric exporter on a windows server...I am trying to use the metric from that called "windows_process_start_time" to visualize how many times a process has restarted based on the process and process_id.
this is my promQL used within the Grafana dashboard
changes( windows_process_start_time { process="Foo.*", process_id=~".*"[$__interval] )
however, when a process has stopped, and a new one starts with a new process_id, the 'changes' is no detecting it, so in the grafana dashboard it still shows 0 for a 'Foo' Processor.
What would be an effective PromQL do show when a process restarts?
updated with suggested solution in comment, but this does not work:
avg by (windows_process_start_time { process="Foo.*", process_id=~".*"}) unless avg by (process,process_d) ( windows_process_start_time { process="Foo.*", process_id=~".*"} )

Related

Custom gauge for prometheus Go SDK

I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());

After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

Using grafana counter to visualize weather data

I'm trying to visualize my weather data using grafana. I've already made the prometheus part and now I face an issue that hunts me for quite a while.
I created an counter that adds temperature indoor every five minutes.
var tempIn = prometheus.NewCounter(prometheus.CounterOpts{
Name: "tempin",
Help: "Temperature indoor",
})
for {
tempIn.Add(station.Body.Devices[0].DashboardData.Temperature)
time.Sleep(time.Second*300)
}
How can I now visualize this data that it shows current temperature and stores it for unlimited time so I can look at it even 1 year later like an normal graph?
tempin{instance="localhost:9999"} will only display added up temperature so its useless for me. I need the current temperature not the added up one. I also tried rate(tempin{instance="localhost:9999"}[5m])
How to solve this issue?

Although a counter is not the best solution for this use case, you can use the operator increase.
Increase(tempin{instance="localhost:9999"}[5m])
This will tell you how much the counter increased in the last five minutes

clear prometheus metrics from collector

I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.

Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},

spring batch: alert with grafana & prometheus if a job failed in the last xx minutes

I am using spring batch (4.2.2.RELEASE) together with the spring actuator (2.2.6 RELEASE). Since version 4.2, spring batch provides support for batch monitoring and metrics based on micrometer (https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/monitoring-and-metrics.html).
For example i am able to see with the metric name spring_batch_job how often a job was executed, its status and duration.
I want to monitor this metric with grafana & prometheus and alert if a job failed in the last xx minutes.
If the spring batch application runs as a service it seems that it sums up all the metrics until the service is stopped. For example if a job was started 12 times in the last hour the metrics output could be the following:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 10.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 354.354538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
So two instances of the mainJob failed. Assumed in the next hour all 12 jobs will be successful, the metrics output would be:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 22.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 708.704538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
How am i able to check if a job failed in the last xx minutes? Because the following expression would still return the two failed job instances: spring_batch_job_seconds_count{status="FAILED"}[15m]

I'm not familiar with Prometheus QL but I will try to help.
What you can do is to calculate the difference of this counter between the last hour and the hour before. If you see an increase in the number of failed instances, then at least one instance has failed and you can raise an alert. Otherwise, no job has failed in the previous hour.
Prometheus provides the increase function that is designed specifically for that. So you should be able to answer your question and raise an alert when:
increase(spring_batch_job_seconds_count{name="mainJob",status="FAILED"}[15m]) > 0
As I said, I'm not expert at Prometheus, so I will let you check the syntax. But that's the idea.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to determine a windows process has restarted using PromQL and Grafana - windows

Related

Custom gauge for prometheus Go SDK

Dataflow job has high data freshness and events are dropped due to lateness

Using grafana counter to visualize weather data

clear prometheus metrics from collector

spring batch: alert with grafana & prometheus if a job failed in the last xx minutes

Categories

Resources