Ops unit in Grafana - spring-boot

I have put annotation io.micrometer.core.annotation.Timed on a spring rest endpoint and configured prometheus. It provides me with three metrics on grafana:
myMetricsName_seconds_count
myMetricsName_seconds_sum
myMetricsName_seconds_max
Going by the name, I assume count tells total number of times endpoint is called, sum gives us total time these all calls take, and max tells the maximum time taken amongst all these calls.
On Grafana, the graph I see for these 3 metrics have same unit for Y axis - ops.
Shouldn't units be different?

You are correct that the count is the total calls and sum is the total duration. max is the max in a moving window, (as in 'max in the last minute')
The units should be different in Grafana. However, if you are asking if the count should be named something different than myMetricsName_seconds_count, the reason it has 'seconds' in the name is to keep it parallel to its matching sum.
That way you can infer they are correlated and can divide the sum by the count to get the average duration.

Related

Differnces between __execute-count value and values gathered by the Metrics Reporting API v2

I have run a topology, and I used the Meter type in metric Reporting API v2. In the execute method I mark this metric. So it will mark an event whenever the execute method is called. But when I compare this value with the __execute-count, I see huge differences. Does anyone know why this happens?
These are the values from my log which are gathered at the same time:
9:v7 __execute-count {v0:v7=44500}
9:v7 tuple_inRate.count 664129
Update:
When I use the mark method on the Meter metric, I will get different results in comparison with the Counter metric. But still, I do not understand why the values from the counter metric (tuple counter) are not the same as the __execute-count.
As given in this answer, Storms Internal Metrics are just estimated by a percentage of the real data flow. Initially, it uses 5% of incoming tuples to make those estimations. This may lead to inaccuracies for extreme high or low throughputs.
EDIT: The documentation describes the following:
In general all of these tuple count metrics are randomly sub-sampled unless otherwise stated. This means that the counts you see both on the UI and from the built in metrics are not necessarily exact. In fact by default we sample only 5% of the events and estimate the total number of events from that. The sampling percentage is configurable per topology through the topology.stats.sample.rate config. Setting it to 1.0 will make the counts exact, but be aware that the more events we sample the slower your topology will run (as the metrics are counted in the same code path as tuples are processed). This is why we have a 5% sample rate as the default.
EDIT 2 In this post, there is more information about the estimation:
The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
By the way, execute_count is just an increasing number, while your tuple_inRate.count is a rate, isn`t it?

How to accommodate minutely and hourly data in the same visualisation?

Current Scenario -
The current dashboard is set to Sum aggregation at minutely level. My dashboard currently works only when interval is set to minutely. If I change the interval the current graph shows incorrect values. This happens due to the fact that there are more than 1 documents generated per minute and the correct value per minute will be the sum of the field values at minutely level.
So even today we are obliged to use minute interval but I'm fine with this.
Now the hourly documents is designed to ingest data after doing all the math( and we have validated the ingestion logic). So there is 1 doc per hour. This is the reason the visualisation is not able to accommodate both types of data.
If I had a scenario like 1 document per minute and then 1 document per hour, then I could have gone with using average metrics or perhaps max metrics but at present the problem is I have to do sum of the doc values for a minute (mandatory), therefore, whatever internal logic applies for minutely data gets also applied to hourly too.
Is there a way where I can show both types of data in the same graph?
Mathematically, the approach is wrong.
Having n documents per minute (where n depends on the no. of hosts in that cluster) and then 1 document per hour per type is illogical from visualisation perspective because the actual value needed was the sum of all n documents generated per min and so the sum metric that was being applied at minutely level was also getting applied at hourly data. If we wanted to accommodate both types of data in the same graph, there is a need of uniformity and thus, aggregate the data at minutely level from other end and then send aggregated data to elastic.

How is total throughput value calculated in Aggregate Report?

I discovered, that in Aggregate Report TOTAL THROUGHPUT value depends on thread count. And if we run tests with only one thread, total throughput is calculated as 1 / Total Average (and multiplied by 1000 to convert milliseconds to seconds, see the screenshot below).
But when we set thread count to 2 or more, total throughput is calculated the unknown way, so what I want to know is which formula is used when calculating total throughput in this case (thread count > 1), because it does not seem to be an average of all requests throughput, it's also not calculated as 1 / Total Average like described in the first case. So how exactly does this work? (Screenshot for 2 threads attached below)
Thanks.
Screenshot for 1 thread used:
aggregate_1_thread.png
Screenshot for 2 threads used:
aggregate_2_threads.png
As per doc:
http://jmeter.apache.org/usermanual/component_reference.html#Aggregate_Report
Throughput - the Throughput is measured in requests per second/minute/hour. The time unit is chosen so that the displayed rate is at least 1.0. When the throughput is saved to a CSV file, it is expressed in requests/second, i.e. 30.0 requests/minute is saved as 0.5.
So result depends both on Response time and Number of threads which influences those response times.
The total number of requests is divided by the time taken to run them, see:
https://github.com/apache/jmeter/blob/trunk/src/core/org/apache/jmeter/visualizers/SamplingStatCalculator.java#L198

Bin packing parts of a dynamic set, considering lastupdate

There's a large set of objects. Set is dynamic: objects can be added or deleted any time. Let's call the total number of objects N.
Each object has two properties: mass (M) and time (T) of last update.
Every X minutes a small batch of those should be selected for processing, which updates their T to current time. Total M of all objects in a batch is limited: not more than L.
I am looking to solve three tasks here:
find a next batch object picking algorithm;
introduce object classes: simple, priority (granted fit into at least each n-th batch) and frequent (fit into each batch);
forecast system capacity exhaust (time to add next server = increase L).
What kind of model best describes such a system?
The whole thing is about a service that processes the "objects" in time intervals. Each object should be "measured" each N hours. N can vary in a range. X is fixed.
Objects are added/deleted by humans. N grows exponentially, rather slow, with some spikes caused by publications. Of course forecast can't be precise, just some estimate. M varies from 0 to 1E7 with exponential distribution, most are closer to 0.
I see there can be several strategies here:
A. full throttle - pack each batch as much as close to 100%. As N grows, average interval a particular object gets a hit will grow.
B. equal temperament :) - try to keep an average interval around some value. A batch fill level will be growing from some low level. When it reaches closer to 100% – time to get more servers.
C. - ?
Here is a pretty complete design for your problem.
Your question does not optimally match your description of the system this is for. So I'll assume that the description is accurate.
When you schedule a measurement you should pass an object, a first time it can be measured, and when you want the measurement to happen by. The object should have a weight attribute and a measured method. When the measurement happens, the measured method will be called, and the difference between your classes is whether, and with what parameters, they will reschedule themselves.
Internally you will need a couple of priority queues. See http://en.wikipedia.org/wiki/Heap_(data_structure) for details on how to implement one.
The first queue is by time the measurement can happen, all of the objects that can't be measured yet. Every time you schedule a batch you will use that to find all of the new measurements that can happen.
The second queue is of measurements that are ready to go now, and is organized by which scheduling period they should happen by, and then weight. I would make them both ascending. You can schedule a batch by pulling items off of that queue until you've got enough to send off.
Now you need to know how much to put in each batch. Given the system that you have described, a spike of events can be put in manually, but over time you'd like those spikes to smooth out. Therefore I would recommend option B, equal temperament. So to do this, as you put each object into the "ready now" queue, you can calculate its "average work weight" as its weight divided by the number of periods until it is supposed to happen. Store that with the object, and keep a running total of what run rate you should be at. Every period I would suggest that you keep adding to the batch until one of three conditions has been met:
You run out of objects.
You hit your maximum batch capacity.
You exceed 1.1 times your running total of your average work weight. The extra 10% is because it is better to use a bit more capacity now than to run out of capacity later.
And finally, capacity planning.
For this you need to use some heuristic. Here is a reasonable one which may need some tweaking for your system. Maintain an array of your past 10 measurements of running total of average work weight. Maintain an "exponentially damped average of your high water mark." Do that by updating each time according to the formula:
average_high_water_mark
= 0.95 * average_high_water_mark
+ 0.5 * max(last 10 running work weight)
If average_high_water_mark ever gets within, say, 2 servers of your maximum capacity, then add more servers. (The idea is that a server should be able to die without leaving you hosed.)
I think answer A is good. Bin packing is to maximize or minimize and you have only one batch. Sort the objects by m and n.

Understanding jmeter terms and result

I am using jmeter to test my web application on tomcat. I just wanted to know the meaning of terms in simplest word: Deviation Throughput Average Median No of Sample
I have tested with
Number of thread(Users):1000
Rampup Period:1
Loop Count:1
No extra settings.
I am attaching the pics for reference. Can anyone tell whether result is good or not ?
No of Sample: Total number of requests sent to server during the test.
Average : Mathematical average of the Response times. This is the number which is quoted as your average response time of your http service.
Deviation : Mathematical standard deviation of the Response times. This shows how much the response time varies. Higher values means problem.
Ideally, your average, max and min Response times should be same. Of course, this is not a practical option. So you will target to keep the deviation as low as possible. Higher values generally means system stress - unless you are writing some kind of exponential backoff operations. Your Min and Max values shows very high difference and your deviation is way too high. If you are writing a simple HTTP service, you min - max values should have similar RT values.
In summary , For me, your Jmeter test result really looks scary and is leading me to believe you had run the test and the server on same machine leading to machine getting overloaded.Or the code is really buggy and gets bogged down on load.
Throughput : Simple term to define number of requests you can process per second or minute.
Median : Mathematical Median of the RT. Arrange the RTs in order and select the middle value. This should be as close to average value as possible.

Resources