What are the units for OpenTSDB's Rate function? - metrics

Let's say I have a byte counter metric that increments once per second. If I plot it, I will get a monotonically increasing plot. The Y-axis is labeled 'bytes'.
I want to plot the rate of change of my counter, so I click the "Rate" checkbox. Rate is change per unit time, but what is that unit? What label should the Y-axis have?

The Rate feature is a OpenTSDB query feature. According to the OpenTSDB docs
The rate is the first derivative of the values. It's defined as (v2 -
v1) / (t2 - t1). Therefore you will get the rate of change per
second. Currently the rate of change between millisecond values
defaults to a per second calculation.

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.
 
From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.
In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

Differnces between __execute-count value and values gathered by the Metrics Reporting API v2

I have run a topology, and I used the Meter type in metric Reporting API v2. In the execute method I mark this metric. So it will mark an event whenever the execute method is called. But when I compare this value with the __execute-count, I see huge differences. Does anyone know why this happens?
These are the values from my log which are gathered at the same time:
9:v7 __execute-count {v0:v7=44500}
9:v7 tuple_inRate.count 664129
Update:
When I use the mark method on the Meter metric, I will get different results in comparison with the Counter metric. But still, I do not understand why the values from the counter metric (tuple counter) are not the same as the __execute-count.
As given in this answer, Storms Internal Metrics are just estimated by a percentage of the real data flow. Initially, it uses 5% of incoming tuples to make those estimations. This may lead to inaccuracies for extreme high or low throughputs.
EDIT: The documentation describes the following:
In general all of these tuple count metrics are randomly sub-sampled unless otherwise stated. This means that the counts you see both on the UI and from the built in metrics are not necessarily exact. In fact by default we sample only 5% of the events and estimate the total number of events from that. The sampling percentage is configurable per topology through the topology.stats.sample.rate config. Setting it to 1.0 will make the counts exact, but be aware that the more events we sample the slower your topology will run (as the metrics are counted in the same code path as tuples are processed). This is why we have a 5% sample rate as the default.
EDIT 2 In this post, there is more information about the estimation:
The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
By the way, execute_count is just an increasing number, while your tuple_inRate.count is a rate, isn`t it?

Why Std. Dev. total in Jmeter has value of '8596.41' while all transactions are showing '0.00'

Why Std. Dev. total in Jmeter has the value of '8596.41' while all transactions are showing '0.00'?
The standard deviation for the individual sampler is 0.00 because there is only one request/data per sample. So there is no standard deviation for only one data.That's the reason all the data e.g Average, Min, Max is the same number "4038" for the first row.
Now in the 6th row, it calculates the Total value.The field Avg, Min, Max are for all the five requests. The average is calculated based upon above 5 data.Same also happened for the Standard Deviation column. The value of std.dev at the last row is the value calculated based upon the individual average value in the above five rows. The std. dev for five data 4038,10054, 12793, 26361,2002 is 8596.408939 which is ~ 8596.41.
Please refer to this link for step-by-step calculations to work out the Standard Deviation

How do I weight my rate by sample size (in Datadog)?

So I have an ongoing metric of events. They are either tagged as success or fail. So I have 3 numbers; failed, completed, total. This is easily illustrated (in Datadog) using a stacked bar graph like so:
So the dark part are the failures. And by looking at the y scale and the dashed red line for scale, this easily tells a human if the rate is a problem and significant. Which to mean means that I have a failure rate in excess of 60%, over at least some time (10 minutes?) and that there are enough events in this period to consider the rate exceptional.
So I am looking for some sort of formula that starts with: failures divided by total (giving me a score between 0 and 1) and then multiplies this somehow again with the total and some thresholds that I decide means that the total is high enough for me to get an automated alert.
For extra credit, here is the actual Datadog metric that I am trying to get to work:
(sum:event{status:fail}.rollup(sum, 300) / sum:event{}.rollup(sum,
300))
And I am watching for 15 minutes and alert of score above 0.75. But I am not sure about sum, count, avg, rollup or count. And ofc this alert will send me mail during the night when the total events goes low enough to were a high failure rate isn't proof of any problem.

Metrics doesn't decay when no values are reported

I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.

Resources