Spring Boot business metrics. Micrometer & Prometheus or InfluxDB - spring-boot

We need in our system to calculate and visualize a couple of business metrics like:
total number of transactions processed over last configured time interval
average processing time over last configured time interval
max processing time over last configured time interval
min processing time over last configured time interval
I have to expose these metrics somehow from SpringBoot application.
I checked whether it is possible to calculate this type of metrics at the application level (Spring boot) using the Micrometer library built into the Spring Boot Actuator. Unfortunately, I don't see that it allows using Meters to calculate the average value or the minimum value of particular method execution (with configured time interval)
The use of Prometheus also doesn't seem to be the best idea because it works on the pull-based principle. It seems to me that this may render the results inaccurate and delayed because of scraping intervals.
My last idea is to write each transaction processing time to InfluxDB or a similar DB and then, using queries get the results I need (business metrics). However, I am worried about the efficiency of this solution as it introduces additional time to each business transaction
What do you think about it? Am I right about the limitations of the Micrometer? Does the idea from influxDB sound reasonable? Maybe another way to approach this problem?

Related

How to calculate Latency and Bandwidth in Java Microbenchmark tests

We are planning implementing Microbenchmark tests for our java based springboot project for which we are planning to leverage JMH.
We are in need of validating Latency and Bandwidth metric and not sure how to fetch these from JMH.
Can someone suggest a sample code or reference?
First of all, JMH measures throughput, not bandwidth.
Here's the example of measuring costs of Spring Boot-based application start-up: https://github.com/stsypanov/spring-boot-benchmark
And here are various examples of Spring-related measurements with JMH: https://github.com/stsypanov/spring-benchmark (e.g. this benchmark allows to measure costs of getting a new prototype-scoped bean from application context. As of your question if you want to measure latency (i.e. average execution time of benchmarked method) you use the mentioned benchmark as is. If you want to measure throughput then just change
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
to
#BenchmarkMode(Mode.Throughput)
#OutputTimeUnit(TimeUnit.SECONDS)
The second annotation set tells JMH to print how many times the method can be called within a second.

How to determine accurate request count in a time range with Spring Boot + Prometheus + Grafana

I just started trying to integrate micrometer, prometheus and Grafana into my microservices. At a first glance, it is very easy to use and there are many existing dashboard you can rely on. But the more I test the more it gets confusing. Maybe I don't understand the main idea behind this technology stack.
I would like to start my custom Grafana dashboard by showing the amount of request per endpoint for the selected time range (as a single stat), but I am not able to find the right query for that (and I am not sure it exists)
I tried different:
http_server_requests_seconds_count{uri="/users"}
Which always shows the current value. For example, if I sent 10 requests 30 minutes ago, this query will also return value 10 when I am changing changing the time range last 5 minutes (even though no request was entering the system during the last 5 minutes)
When I am using
increase(http_server_requests_seconds_count{uri="/users"}[$__range])
the query will not return the accurate value, instead something close to actual request amount. At least it works for a time range that doesn't include new incoming requests. In that case the query return 0.
So my question is, is there a way to use this Technology stack to get the amount of new requests for the selected period of time?
For the sake of performance when operating with millions of time series, many Prometheus functions show approximate and/or interpolated values. For example, the increase() function is basically a per-second rate() multiplied by the number of seconds in the interval. With such formula and possible missing data points, an accurate result is rather an exception than a normal thing.
The reason why it is so is that Prometheus exchanges accuracy for performance and reliability. It doesn't really matter if your server actual CPU usage is 86.3% instead of 86.4%, but it does matter whether you can get this information instantly. Prometheus even have this statement in their docs:
Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough. In such a case you would be best off using some other system to collect and analyze the data for billing, and Prometheus for the rest of your monitoring.
That being said, if you really need accurate values consider using something else. You can for example store logs and count lines (Grafana Loki, The Elastic Stack), or maybe write and retrieve this information from a traditional database with your own solution.

Microservices - Connection Pooling when connecting to a single legacy database

I am working on developing micro services for a monolithic application using spring boot + spring cloud + spring JDBC.
Currently, the application is connecting to a single database through tomcat JNDI connection pool.
We have a bottleneck here, not to change the database architecture at this point of time because of various reasons like large number of db objects,tight dependencies with other systems,etc.
So we have isolated the micro services based on application features. My concern is if we develop microservices with each having its own connection pool, then the number of connections to the database can increase exponentially.
Currently, I am thinking of two solutions
To calculate the number of connections that is being used currently by each application feature and arriving at max/min connection params per service- which is a very tedious process and we don't have any mechanism to get the connection count per app feature.
To develop a data-microservice with a single connection pool which gets the query object from other MS, triggers the query to the database and returns the resultset object to the caller.
Not sure whether the second approach is a best practice in the microservices architechture.
Can you please suggest any other standard approaches that can be helpful in the
current situation?
It's all about the tradeoffs.
To calculate the number of connections that is being used currently by each application feature and arriving at max/min connection params per service.
Cons: As you said, some profiling and guesswork needed to reach the sweet number of connection per app feature.
Pros: Unlike the second approach, you can avoid performance overhead
To develop a data-microservice with a single connection pool which gets the query object from other MS, triggers the query to the database and returns the resultset object to the caller.
Pros : Minimal work upfront
Cons: one more layer, in turn one more failure point. Performance will degrade as you have to deal with serialization -> Http(s) network latency -> deserialization->(jdbc fun stuff which is part of either approach) -> serialization -> Http(s) network latency -> deserialization. (In your case this performance cost may be negligible. But if every millisecond counts in your service, then this is a huge deciding factor)
In my opinion, I wouldn't split the application layer alone until I have analyzed my domains and my datastores.
This is a good read: http://blog.christianposta.com/microservices/the-hardest-part-about-microservices-data/
I am facing a similar dilemma at my work and I can share the conclusions we have reached so far.
There is no silver bullet at the moment, so:
1 - Calculate the number of connections dividing the total desired number of connections for the instances of microservices will work well if you have a situation where your microservices don't need to drastically elastic scale.
2 - Not having a pool at all and let the connections be opened on demand. This is what is being used in functional programming (like Amazon lambdas). It will reduce the total number of open connections but the downside is that you lose performance as per opening connections on the fly is expensive.
You could implement some sort of topic that let your service know that the number of instances changed in a listener and update the total connection number, but it is a complex solution and goes against the microservice principle that you should not change the configurations of the service after it started running.
Conclusion: I would calculate the number if the microservice tend to not grow in scale and without a pool if it does need to grow elastically and exponentially, in this last case make sure that a retry is in place in case it does not get a connection in the first attempt.
There is an interesting grey area here awaiting for a better way of controlling pools of connections in microservices.
In time, and to make the problem even more interesting, I recommend reading the
article About Pool Sizing from HikariCP: https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing
The ideal concurrent connections in a database are actually smaller than most people think.

AppDynamics Custom DropWizard Percentiles Metrics Rollout

We have cluster of instances whereas each instance has DropWizard metrics gatherer.
We're also trying to leverage AppDynamics custom metrics and that works so that custom script hits DropWizard exposed endpoint (/metrics) and sends metrics of interest to AppDynamics Controller.
AppDynamics has 2 cluster rollout strategies for how the metric is displayed in a whole application view (tier) - SUM and AVG.
While this works well for stuff like counts (sum is used) and average processing times (avg is used) - we for now don't have any idea of how to aggregate each instance percentiles exposed by DropWizard - neither sum nor avg looks correct.
Example:
instance1: p75=400
instance2: p75=600
instance3: p75=800
sum will give 1700 what of course isn't useful at all.
avg will give 600 - which isn't correct either - we're losing track of higher bound.
If AppDynamics had MAX Cluster rollout - that would be more or less fair - still not correct though. But AppDynamics doesn't have that.
We also understand that the only fully correct way of gathering cluster percentiles is to perform aggregation from all nodes at one place (e.g. logstash, etc..) and not on each instance. But for now that's what we have - just sending custom metrics periodically.
It would be great if anyone suggests something regarding that.
Thanks in advance,

Why use statsd when graphite's Carbon aggregator can do the same job?

I have been exploring the Graphite graphing tool for showing metrics from multiple servers, and it seems that the 'recommended' way is to send all metrics data to StatsD first. StatsD aggregates the data and sends it to graphite (or rather, Carbon).
In my case, I want to do simple aggregations like sum and average on metrics across servers and plot that in graphite. Graphite comes with a Carbon aggregator which can do this.
StatsD does not even provide aggregation of the kind I am talking about.
My question is - should I use statsd at all for my use case? Anything I am missing here?
StatsD operates over UDP, which removes the risk of carbon-aggregator.py being slow to respond and introducing latency in your application. In other words, loose coupling.
StatsD supports sampling of inbound metrics, which is useful when you don't want your aggregator to take 100% of all data points to compute descriptive statistics. For high-volume code sections, it is common to use 0.5%-1% sample rates so as to not overload StatsD.
StatsD has broad client-side support.
tldr: you will probably want statsd (or carbon-c-relay) if you ever want to look at the server-specific sums or averages.
carbon aggregator is designed to aggregate values from multiple metrics together into a single output metric, typically to increase graph rendering performance. statsd is designed to aggregate multiple data points in a single metric, because otherwise graphite only stores the last value reported in the minimum storage resolution.
statsd example:
assume that your graphite storage-schemas.conf file has a minimum retention of 10 seconds (the default) and your application is sending approximately 100 data points every 10 seconds to services.login.server1.count with a value of 1. without statsd, graphite would only store the last count received in each 10-second bucket. after the 100th message is received, the other 99 data points would have been thrown out. however, if you put statsd between your application and graphite, then it will sum all 100 datapoints together before sending the total to graphite. so, without statsd your graph only indicates if a login occurred in during the 10 second interval. with statsd, it indicates how many logins occurred in during that interval. (for example)
carbon aggregator example: assume you have 200 different servers reporting 200 separate metrics (services.login.server1.response.time, services.login.server2.response.time, etcetera). on your operations dashboard you show a graph of the average accross all servers using this graphite query: weightedAverage(services.login.server*.response.time, services.login.server*.response.count, 2). unfortunately, rendering this graph takes 10 seconds. to solve this problem, you can add a carbon aggregator rule to pre-calculate the average across all your servers and store the value in a new metric. now you can update your dashboard to simply pull a single metric (e.g. services.login.response.time). the new metric renders almost instantly.
side notes:
the aggregation rules in storage-aggregation.conf apply to all storage intervals in storage-schemas.conf except the first (smallest) retention period for each retention string. it is possible to use carbon-aggregator to aggregate data points within a metric for that first retention period. unfortunately, aggregation-rules.conf uses "blob" patterns rather than regex patterns. so you need to add a separate aggregation-rules.conf file entry for every path depth and aggregation type. the advantage of statsd is that the client sending the metric can specify the aggregation type rather than encoding it in the metric path. that gives you the flexibility to add a new metric on the fly regardless of metric path depth. if you wanted to configure carbon-aggregator to do statsd-like aggregation automatically when you add a new metric, your aggregation-rules.conf file would look something like this:
<n1>.avg (10)= avg <n1>.avg$
<n1>.count (10)= sum <n1>.count$
<n1>.<n2>.avg (10)= avg <n1>.<n2>.avg$
<n1>.<n2>.count (10)= sum <n1>.<n2>.count$
<n1>.<n2>.<n3>.avg (10)= avg <n1>.<n2>.<n3>.avg$
<n1>.<n2>.<n3>.count (10)= sum <n1>.<n2>.<n3>.count$
...
<n1>.<n2>.<n3> ... <n99>.count (10)= sum <n1>.<n2>.<n3> ... <n99>.count$
notes: the trailing "$" is not needed in graphite 0.10+ (currently pre-release) here is the relevant patch on github and here is the standard documentation on aggregation rules
the weightedAverage function is new in graphite 0.10, but generally the averageSeries function will give a very similar number as long as your load is evenly balanced. if you have some servers that are both slower and service fewer requests or you are just a stickler for precision, then you can still calculate a weighted average with graphite 0.9. you just need to build a more complex query like this:
divideSeries(sumSeries(multiplySeries(a.time,a.count), multiplySeries(b.time,b.count)),sumSeries(a.count, b.count))
if statsd is run on the client box this also reduces network load. although, in theory, you could run carbon-aggregator on the client side too. however, if you use one of the statsd client libraries, you can also use sampling to reduce the load on your application machine's cpu (e.g. creating loopback udp packets). furthermore, statsd can automatically perform multiple different aggregations on a single input metric (sum, mean, min, max, etcetera)
if you use statsd on each app server to aggregate response times, and then re-aggregate those values on the graphite server using carbon aggregator, you end up with an average response time weighted by app server rather than request. obviously, this only matters for aggregating using a mean or top_90 aggregation rule, and not min, max or sum. also, it only matters for mean if your load is unbalanced. as an example: assume you have a cluster of 100 servers, and suddenly 1 server is sent 99% of the traffic. consequentially, the response times quadruple on that 1 server, but remain steady on the other 99 servers. if you use client side aggregation, your overall metric would only go up about 3%. but if you do all your aggregation in a single server-side carbon aggregator, then your overall metric would go up by about 300%.
carbon-c-relay is essentially a drop-in replacement for carbon-aggregator written in c. it has improved performance and regex-based matching rules. the upshot being that you can do both statsd-style datapoint aggregation and carbon-relay style metric aggregation and other neat stuff like multi-layered aggregation all in the same simple regex-based config file.
if you use the cyanite back-end instead of carbon-cache, then cyanite will do the intra-metric averaging for you in memory (as of version 0.5.1) or at read time (in the version <0.1.3 architecture).
If the Carbon aggregator offers everything you need, there is no reason not to use it. It has two basic aggregation functions (sum and average), and indeed these are not covered by StatsD. (I'm not sure about the history, but maybe the Carbon aggregator already existed and the StatsD authors did not want to duplicate features?) Receiving data via UDP is also supported by Carbon, so the only thing you would miss would be the sampling, which does not matter if you aggregate by averaging.
StatsD supports different metric types by adding extra aggregate values (e.g. for timers: mean, lower, upper and upper Xth percentile, ...). I like them, but if you don't need them, the Carbon aggregator is a good way to go too.
I have been looking at the source code of the Carbon aggregator and StatsD (and Bucky, a StatsD implementation in Python), and they are all so simple, that I would not worry about resource usage or performance for either choice.
Looks like carbon aggregator and statsd support disjoint set of features:
statsd supports rate calculation and summation but does not support averaging values
carbon aggregator supports averaging but does not support rate calculation.
Because graphite has a minimum resolution, so you cannot save two different values for the same metric during defined interval. StatsD solves this problem by pre-aggregating them, and instead of saying "1 user registered now" and "1 user registered now" it says "2 users registered".
The other reason is performance because:
You send data to StatsD via UDP, which is a fire and forget protocol, stateless, much faster
StatsD etsy's implementation is in NodeJS which also increases the performance a lot.

Resources