I am monitoring a spring boot application in promethus with metrics generated by micrometer.
For CPU usage, there is metrics 'system_cpu_usage'.
I observe that its value is mostly under 1. Is it expected? Same application when monitored in VisualVM, CPU graph is always above 15 percent range.
Do I need to multiple the value by 100?
Yes. Micrometer system_cpu_usage is reported between 0.0-1.0 (1.0 being cpu completely occupied). Hence you need to multiple the value by 100 to get in percentage terms.
Related
We need in our system to calculate and visualize a couple of business metrics like:
total number of transactions processed over last configured time interval
average processing time over last configured time interval
max processing time over last configured time interval
min processing time over last configured time interval
I have to expose these metrics somehow from SpringBoot application.
I checked whether it is possible to calculate this type of metrics at the application level (Spring boot) using the Micrometer library built into the Spring Boot Actuator. Unfortunately, I don't see that it allows using Meters to calculate the average value or the minimum value of particular method execution (with configured time interval)
The use of Prometheus also doesn't seem to be the best idea because it works on the pull-based principle. It seems to me that this may render the results inaccurate and delayed because of scraping intervals.
My last idea is to write each transaction processing time to InfluxDB or a similar DB and then, using queries get the results I need (business metrics). However, I am worried about the efficiency of this solution as it introduces additional time to each business transaction
What do you think about it? Am I right about the limitations of the Micrometer? Does the idea from influxDB sound reasonable? Maybe another way to approach this problem?
I have run a topology, and I used the Meter type in metric Reporting API v2. In the execute method I mark this metric. So it will mark an event whenever the execute method is called. But when I compare this value with the __execute-count, I see huge differences. Does anyone know why this happens?
These are the values from my log which are gathered at the same time:
9:v7 __execute-count {v0:v7=44500}
9:v7 tuple_inRate.count 664129
Update:
When I use the mark method on the Meter metric, I will get different results in comparison with the Counter metric. But still, I do not understand why the values from the counter metric (tuple counter) are not the same as the __execute-count.
As given in this answer, Storms Internal Metrics are just estimated by a percentage of the real data flow. Initially, it uses 5% of incoming tuples to make those estimations. This may lead to inaccuracies for extreme high or low throughputs.
EDIT: The documentation describes the following:
In general all of these tuple count metrics are randomly sub-sampled unless otherwise stated. This means that the counts you see both on the UI and from the built in metrics are not necessarily exact. In fact by default we sample only 5% of the events and estimate the total number of events from that. The sampling percentage is configurable per topology through the topology.stats.sample.rate config. Setting it to 1.0 will make the counts exact, but be aware that the more events we sample the slower your topology will run (as the metrics are counted in the same code path as tuples are processed). This is why we have a 5% sample rate as the default.
EDIT 2 In this post, there is more information about the estimation:
The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
By the way, execute_count is just an increasing number, while your tuple_inRate.count is a rate, isn`t it?
We have cluster of instances whereas each instance has DropWizard metrics gatherer.
We're also trying to leverage AppDynamics custom metrics and that works so that custom script hits DropWizard exposed endpoint (/metrics) and sends metrics of interest to AppDynamics Controller.
AppDynamics has 2 cluster rollout strategies for how the metric is displayed in a whole application view (tier) - SUM and AVG.
While this works well for stuff like counts (sum is used) and average processing times (avg is used) - we for now don't have any idea of how to aggregate each instance percentiles exposed by DropWizard - neither sum nor avg looks correct.
Example:
instance1: p75=400
instance2: p75=600
instance3: p75=800
sum will give 1700 what of course isn't useful at all.
avg will give 600 - which isn't correct either - we're losing track of higher bound.
If AppDynamics had MAX Cluster rollout - that would be more or less fair - still not correct though. But AppDynamics doesn't have that.
We also understand that the only fully correct way of gathering cluster percentiles is to perform aggregation from all nodes at one place (e.g. logstash, etc..) and not on each instance. But for now that's what we have - just sending custom metrics periodically.
It would be great if anyone suggests something regarding that.
Thanks in advance,
I'm monitoring docker containers via Prometheus.io. My problem is that I'm just getting cpu_user_seconds_total or cpu_system_seconds_total.
How to convert this ever-increasing value to a CPU percentage?
Currently I'm querying:
rate(container_cpu_user_seconds_total[30s])
But I don't think that it is quite correct (comparing to top).
How to convert cpu_user_seconds_total to CPU percentage? (Like in top)
Rate returns a per second value, so multiplying by 100 will give a percentage:
rate(container_cpu_user_seconds_total[30s]) * 100
I also found this way to get CPU Usage to be accurate:
100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)
From: http://www.robustperception.io/understanding-machine-cpu-usage/
Note that container_cpu_user_seconds_total and container_cpu_system_seconds_total are per-container counters, which show CPU time used by a particular container in user space and in kernel space accordingly (see these docs for more details). Cadvisor exposes additional metric - container_cpu_usage_seconds_total. This metric equals to the sum of the container_cpu_user_seconds_total and container_cpu_system_seconds_total, e.g. it shows overall CPU time used by each container. See these docs for more details.
The container_cpu_usage_seconds_total is a counter, e.g. it increases over time. This isn't very informative for determining CPU usage at a particular time. Prometheus provides rate() function, which returns the average per-second increase rate over counters. For example, the followign query returns the average per-second increase of per-container container_cpu_usage_seconds_total metrics over the last 5 minutes (see 5m lookbehind window in square brackets):
rate(container_cpu_usage_seconds_total[5m])
This is basically the average number of CPU cores used during the last 5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note that the resulting value may exceed 100% if the container uses more than a single CPU core during the last 5 minutes.
The rate(container_cpu_usage_seconds_total[5m]) usually returns a TON of time series with many long labels in production Kubernetes, so it is better to use the following queries:
The average number of CPU cores used during the last 5 minutes per each pod:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
The average number of CPU cores used during the last 5 minutes per each node:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
The average number of CPU cores used during the last 5 minutes per each namespace:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
The container!="" filter removes superfluous metrics related to cgroups hierarchy - see this answer for more details.
For Windows Users - wmi_exporter
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
What is the performance impact of setting topology.stats.sample.rate: 1.0 in yaml?
How this works?
topology.stats.sample.rate configures the rate at which a Storm topology statistics would be calculated.
Default value in defaults.yaml is 0.05. This means only five out of 100 events are taken into account.
The value of 1 means each tuple's statistics is going to be calculated.
Is this going to decrease performance? Most likely many will say yes but since each environment is different, I would say it is better to measure it yourself. Increase and decrease the value and measure the throughput of your topology.