clear prometheus metrics from collector - go

I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.

Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},

Related

Custom gauge for prometheus Go SDK

I have a APP that monitors some external jobs (among other things). This does this monitoring every 5 Mins
I'm trying to create a prometheus gauge to get the count of currently running jobs.
Here is how I declared my gauge
JobStats= promauto.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "myapi",
Subsystem: "app",
Name: "job_count",
Help: "Current running jobs in the system",
ConstLabels: nil,
},
[]string{"l1", "l2", "l3"},
)
in the code that actually counts the jobs I do
metrics.JobStats.WithLabelValues(l1,l2,l3).add(float64(jobs_cnt))
when I query the /metrics endpoint I get the number
The thing is, this metrics only keeps increasing. If I restart the app this get resets to zer & again keeps increasing
I'm using grafana to graph this in a dashboard.
My question is
Get the graph to show the actual number of jobs (instead of ever increasing line)?
Should this be handled in code (like setting this to zero before every collection?) or in grafana?

NATS KV history larger than specified when creating bucket

We are using nats with KeyValue store feature (nats KV). We develop go microservices and use the nats go client. We try to leverage the history feature of nats KV with no success yet.
Certain times using nats, we retrieve a larger history than the history specified when creating the KV.
We create the KV using :
kv, _ := js.CreateKeyValue(&nats.KeyValueConfig{
Bucket: "some-bucket",
Description: "store for some-service",
MaxValueSize: 0,
History: 10, // should we ever get more than 10 elements when reading history ?
TTL: TTL,
MaxBytes: 5000000,
Storage: nats.MemoryStorage,
Replicas: 0,
Placement: nil,
})
and we retrieve values using
kv.History("someId")
When we get results larger than the specified History, we get several KeyValueEntrys with the same delta value.
We are quite write intensive, and also reuse quite a lot the same key id :
we write values until a certain point,
call kv.Purge("someId")
and then we may reuse "someId" later on in the process.
Writes and read are asynchronous and concurrent.
Here is our client go.mod regarding nats:
github.com/nats-io/nats-server/v2 v2.8.4
github.com/nats-io/nats.go v1.16.0
and we run a nats server version 2.8.4.
note : I did not go far enough in the KV implementation details but I am worried that this is linked with jetstream. It seems like a watcher is created each time and re-reads all previous values regardless of history size. It leads me to another question : is the kv history feature appropriate for read intensive use cases ?
Thanks for your help or pointers on this matter.

How to get Prometheus Node Exporter metrics with JSON format

I deployed Prometheus Node Exporter pod on k8s. It worked fine.
But when I try to get system metrics by calling Node Exporter metric API in my custom Go application
curl -X GET "http://[my Host]:9100/metrics"
The result format was like this
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.7636e-05
go_gc_duration_seconds{quantile="0.25"} 2.466e-05
go_gc_duration_seconds{quantile="0.5"} 5.7992e-05
go_gc_duration_seconds{quantile="0.75"} 9.1109e-05
go_gc_duration_seconds{quantile="1"} 0.004852894
go_gc_duration_seconds_sum 1.291217651
go_gc_duration_seconds_count 11338
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.12.5"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.577128e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 2.0073577064e+10
.
.
.
something like this
Those long texts are hard to parse and I want to get the results in JSON format to parse them easily.
https://github.com/prometheus/node_exporter/issues/1062
I checked Prometheus Node Exporter GitHub Issues and someone recommended prom2json.
But this is not I'm looking for. Because I have to run extra process to execute prom2json to get results. I want to get Node Exporter's system metric by simply calling HTTP request or some kind of Go native packages in my code.
How can I get those Node Exporter metrics in JSON format?
You already mentioned prom2json and you can pull the package into your Go file by importing github.com/prometheus/prom2json.
The sample executable in the repo has the all building blocks you need. First, open the URL and then use the prom2json package to read the data and store the result.
However, you should also have a look at expfmt.TextParser as that is the native way to ingest Prometheus formatted metrics.

Traces not getting sampled in jaeger tracing

I'm new to using Jaeger tracing system and have been trying to implement it for a flask based microservices architecture. Below is my jaeger client config implemented in python:
config = Config(
config = {
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
'reporter_batch_size': 1,
},
service_name=service,
)
I read somewhere that Sampling strategy is being used to sample the number of traces especially for the trace which doesn't have any metadata. So as per this config, does it mean that I'm sampling each and every trace or just the few traces randomly? Mysteriously, when I'm passing random inputs to create spans for my microservices, the spans are getting generated only after 4 to 5 minutes. I would like to understand this configuration spec more but not able to.
So as per this config, does it mean that I'm sampling each and every trace or just the few traces randomly?
Using the sampler type as const with 1 as the value means that you are sampling everything.
Mysteriously, when I'm passing random inputs to create spans for my microservices, the spans are getting generated only after 4 to 5 minutes. I would like to understand this configuration spec more but not able to.
There are several things that might be happening. You might not be closing spans, for instance. I recommend reading the following two blog posts to try to understand what might be happening:
Help! Something is wrong with my Jaeger installation!
The life of a span

How to use RabbitMQ http api to see what queue had a messages in a ready state

I have a RabbitMQ server setup with thousands of queues. Of which only about 5 of these are persistent queues. Every now and then there is a back up of a queue that will have about 5-10 messages in a ready state. These messages do not appear to be in the persistent queues. I want to find out which queues had the messages in a ready state, but the only indication that it is happening is on the overview page of the web management console which is for all queues.
Is there a way to query Rabbit to tell me the stat info for messages that were in a ready state for a period of minutes and which queue they were in?
I would use the HTTP API.
http://rabbit-broker:15672/api/queues
This will give you a list of the current queue states in JSON so you'll have to keep polling it. Store the "messages_ready" for given queue "name" for the period you want to monitor. Now you'll be able to see which queues have that backlog spike.
You can use simple curl as well as whichever platform you prefer with an HTTP client.
Please note: the user you'll connect will have to have monitor tag to access all the queue information.
Out of the box there is no easy way AFAIK, you'd have to manually click through the queues and look at their graphs in the UI for the last hour, which is tedious.
I had similar requirements and I found a better way than polling. The docs say that you may get raw samples via api if you use special parameters in the request.
For example in your case, if you are interested in messages with ready state, you may ask your queue for a history of queue lengths, for example last 60 seconds with samples every 1 second (note 15672 is the default port used by rabbitmq_management):
http://rabbitHost:15672/api/queues/vhost/queue?lengths_age=60&lengths_incr=1
For default vhost=/ it will be:
http://rabbitHost:15672/api/queues/%2F/queue?lengths_age=60&lengths_incr=1
Then in the result json there will be some additional _details objects like this:
"messages_ready_details": {
"avg": 8.524590163934427,
"avg_rate": 0.08333333333333333,
"samples": [{
"timestamp": 1532699694000,
"sample": 5
}, {
"timestamp": 1532699693000,
"sample": 11
},
<... more samples ...>
],
"rate": -6.0
},
"messages_ready": 5,
Then on this raw data you may do any stats you need.
Other raw data samples appear if you use differen parameters in
What sampling will appear? What parameters are required for it to appear?
Messages sent and received msg_rates_age / msg_rates_incr
Bytes sent and received data_rates_age / data_rates_incr
Queue lengths lengths_age / lengths_incr
Node statistics (e.g. file descriptors, disk space free) node_stats_age / node_stats_incr

Resources