I'm a prometheus newbie and have been trying to figure out the right query to get the last continuous uptime for my service.
For example, if the present time is 0:01:20 my service was up at 0:00:00, went down at 0:01:01 and went up again at 0:01:10, I'd like to see the uptime of "10 seconds".
I'm mainly looking at the "up{}" metric and possibly combine it with the functions (changes(), rate(), etc.) but no luck so far. I don't see any other prometheus metric similar to "up" either.
The problem is that you need something which tells when your service was actually up vs. whether the node was up :)
We use the following (I hope one will help or the general idea of each):
1. When we look at a host we use node_time{...} - node_boot_time{...}
2. When we look at a specific process / container (docker via cadvisor in our case) we use node_time{...} - on(instance) group_right container_start_time_seconds{name=~"..."}) by(name,instance)
The following PromQL query must be used for calculating the application uptime in seconds:
time() - process_start_time_seconds
This query works for all the applications written in Go, which use either github.com/prometheus/client_golang or github.com/VictoriaMetrics/metrics client libraries, which expose the process_start_time_seconds metric by default. This metric contains unix timestamp for the application start time.
Kubernetes exposes the container_start_time_seconds metric for each started container by default. So the following query can be used for tracking uptimes for containers in Kubernetes:
time() - container_start_time_seconds{container!~"POD|"}
The container!~"POD|" filter is needed in order to filter aux time series:
Time series with container="POD" label reflect e.g. pause containers - see this answer for details.
Time series without container label correspond to e.g. cgroups hierarchy. See this answer for details.
If you need to calculate the overall per-target uptime over the given time range, then it is possible to estimate it with up metric. Prometheus automatically generates up metric per each scrape target. It sets it to 1 per each successful scrape and sets it to 0 otherwise. See these docs for details. So the following query can be used for estimating the total uptime in seconds per each scrape target during the last 24 hours:
avg_over_time(up[24h]) * (24*3600)
See avg_over_time docs for details.
Related
I'm trying to setup some graphs using prometheus with grafana in a home lab using a single node kubernetes deployed using minikube. I also have some stress test to use on the cluster. I want to measure the results of the stress test using prometheus, so i need help with the following queries:
Cpu usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Memory usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Disk or file system usage of the node/cluster and from an individual pod by given name, in a period of time (ie 5min).
Latency from an individual pod by given name, in a period of time (ie 5min).
If any can help with that, or know a grafana dashboard for that (i've already tried the 737 and the 6417) or give a hint of which metrics i need to consult (i've tried rate(container_cpu_usage_seconds_total[5m]) and this gives me some sort of result for the cpu usage query for the whole node).
You can use Prometheus's labels to get metrics for a specific pod:
CPU (you don't have to provide all labels, you can select only one if that's unique:
sum(rate(container_cpu_usage_seconds_total{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}[5m])) by (pod,kubernetes_io_hostname)
Memory:
sum(container_memory_working_set_bytes{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}) by (pod,kubernetes_io_hostname)
Disk:
kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"<your_node_name>$", persistentvolumeclaim=~".*<your_pod_name>"}
Latency:
You can collect it in your application (web server)? via Prometheus client (application level)
i've been working on some prometheus templating and whenever i look at other examples of templating i keep encountering this command:
$jobs = label_values(job)
$instance = up{job=~"$jobs"}
i understand that $jobs is a variable being created, but i have next to no clue what the up command is doing. I've looked online and i can't really narrow down the search enough for a generic word like 'up' haha
my best guess is that it makes the $instance variable equal only to cases where job is similar to jobs? i'm really not sure
any help would clarify a bunch. thanks!
According to the Jobs and instances Prometheus documentation:
When Prometheus scrapes a target, it attaches some labels
automatically to the scraped time series which serve to identify the
scraped target:
job: The configured job name that the target belongs to.
instance: The <host>:<port> part of the target's URL that was scraped.
If either of these labels are already present in the scraped data, the
behavior depends on the honor_labels configuration option.
For each instance scrape, Prometheus stores a sample in the following
time series:
up{job="<job-name>", instance="<instance-id>"}: 1 if the instance is healthy, i.e. reachable, or 0 if the scrape failed.
The up time series is useful for instance availability monitoring.
I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.
I refer to the "http://www.testautomationguru.com/jmeter-real-time-results-influxdb-grafana/" article,
through grafana + infludx but there is a jmeter tps (throughput) value I do not know how get?
I tried "jmeter.all.h.count" but it did not seem to be the value I wanted:
I wrote the blog you were referring to!
We can not expect the backend listener metrics to give you an accurate result as you expect in the aggregate report - (specially percentiles, avg etc)
BackEndListener basically gives the metrics over time. You should plot the graph using the data over time. If you try to use single stat metric of Grafana with that data , then you will see a complete mismatch.
In the blog - I was using the modified apache_core.jar library to get the results as you are actually expecting. However I stopped sharing (after jmeter 2.13, jmeter 3.0) the modified lib.
You are referring to wrong component for values.
For GraphiteBackendListenerClient, sent values are described here:
http://jmeter.apache.org/usermanual/realtime-results.html
For InfluxdbBackendListenerClient, sent values can be found here:
https://github.com/apache/jmeter/blob/trunk/src/components/org/apache/jmeter/visualizers/backend/influxdb/InfluxdbBackendListenerClient.java
Both components don't send throughput as it can be computed using Grafana and other metrics.
Found your answer in influxdb grafana tutorial,
Use jmeter.all.a.count:
If I query the “jmeter.all.a.count” which has the no of requests processed for every second by the server, I get below output. [No of requests processed / unit time = Throughput]
I am a beginner for Nagios and have read the documentation here.
However, I still have some questions and appreciate if you can please help me:
1) What are the different metrics that nagios captures? I know, it captures CPU, network, disk metrics etc. But I am looking for more detailed information like CPU idle time, CPU busy time etc?
2) If say for CPU, Nagios captures 5 metrics, where can I get the meaning of each metric captured by Nagios?
3) Can I export the metrics captured by Nagios in a CSV file or to an external database?
4) Can we collect custom metrics?
5) How these metrics are captured by Nagios i.e the mechanism or working of Nagios?
Any help is appreciated. Thanks!
I feel that your questions are somewhat generic. Nagios captures snaphots of data over by frequent sampling. You can always access nagios logs and build your own set of detailed information if you are looking to mine for historical data like busy time and idle time. Logstash Elasticsearch Kibana stack springs to mind.
As far as your third question is concerned, Yes it is possible to get csv outputs for availability. In the nagios/cgi-bin/avail.cgi GET request pass &csvoutput as a parameter and you will get the csv format availability reports for all hosts.
Please check this link for more info :
Link
and also:
Link 2