How to determine accurate request count in a time range with Spring Boot + Prometheus + Grafana - spring-boot

I just started trying to integrate micrometer, prometheus and Grafana into my microservices. At a first glance, it is very easy to use and there are many existing dashboard you can rely on. But the more I test the more it gets confusing. Maybe I don't understand the main idea behind this technology stack.
I would like to start my custom Grafana dashboard by showing the amount of request per endpoint for the selected time range (as a single stat), but I am not able to find the right query for that (and I am not sure it exists)
I tried different:
http_server_requests_seconds_count{uri="/users"}
Which always shows the current value. For example, if I sent 10 requests 30 minutes ago, this query will also return value 10 when I am changing changing the time range last 5 minutes (even though no request was entering the system during the last 5 minutes)
When I am using
increase(http_server_requests_seconds_count{uri="/users"}[$__range])
the query will not return the accurate value, instead something close to actual request amount. At least it works for a time range that doesn't include new incoming requests. In that case the query return 0.
So my question is, is there a way to use this Technology stack to get the amount of new requests for the selected period of time?

For the sake of performance when operating with millions of time series, many Prometheus functions show approximate and/or interpolated values. For example, the increase() function is basically a per-second rate() multiplied by the number of seconds in the interval. With such formula and possible missing data points, an accurate result is rather an exception than a normal thing.
The reason why it is so is that Prometheus exchanges accuracy for performance and reliability. It doesn't really matter if your server actual CPU usage is 86.3% instead of 86.4%, but it does matter whether you can get this information instantly. Prometheus even have this statement in their docs:
Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough. In such a case you would be best off using some other system to collect and analyze the data for billing, and Prometheus for the rest of your monitoring.
That being said, if you really need accurate values consider using something else. You can for example store logs and count lines (Grafana Loki, The Elastic Stack), or maybe write and retrieve this information from a traditional database with your own solution.

Related

Grafana singlestat dashboard based on previous average number

i have a dashboard that shows a sum of requests/sec with from a windows performance monitor collected by prometheus.
sum(Total_Query_Received_persec)
I would like to see any issues right away if those request/sec drop ( which will indicate an issue)
So the singlestat panel could change color if the number of request/sec is 50% less than the same number collected 10 minutes ago (for example), change panel coloring to yellow and if the number is 80% less than 10 minutes ago change color to Red.
I know that you can configure this based on thresholds, but not sure if there is a way to query that info in the metric.
Is this possible at all?
Thanks
I'm not familiar enough to grafana to provide all the details of handling the color change scenarios with that tool, but within prometheus the query you are interested in can likely be handled with the irate operator. It's only recommended for working with 'fast moving' counters, and the documentation mentions that you should track the irate() internal to a sum() to keep from hiding the volatility from the function.
You might also get perfectly acceptable performance and results from aggregating the detail with rate directly, such as rate(total_query_received_persec)[10m]

How much concurrent users does your appliance successfully serve? - GSA QPS

Hi fellow GSA developers,
Just wanted to know, in your experience, what model of GSA are you using and how much concurrent search request load does your appliance serve successfully. And the number of total documents you have.
I know each and every environment is different but one can proportionate the data and understand the capability of the GSA Black Box.
I'm calling GSA, a black box, since you can never find out the Physical memory or any other hardware spec, nor can you change it. The only way to scale is to buy more boxes :)
Note: The question is about the GSA as a search engine and not from the portal perspective. In the sense, I'm just concerned about GSA's QPS rather than custom portal's QPS. Since custom portal, well they are custom and they are as good as it's design.
We use two GSAs with Software Version 7.2 and arranged them in a GSA^n "cluster". In the index are ca. 600,000 documents and as all of them are protected the GSA has to spend quite a lot of effort on determining which user is allowed to see which document.
Each of the two GSAs is guaranteed to perform 50 queries per second. We once did a loadtest and as some of the queries were completed in less than a second and thereby freed up the "slot" for incoming queries we were able to process 140 queries per second for a noticeable long time.
99% of our queries are completed in less than a second and as we have a rather complex structure of permissions (users with lots of group memberships) I would say this is a good result.
Like #BigMikeW already said: to get your own figures you should do a load test. Google Support provides a script which can exhaust the GSA and tell you at which QPS rate it started failing (it will simply return a http status code of 500 something).
And talking of "black box": you are able to find out the hardware specs. All of the GSAs I have seen so far (T3 and T4) have a dell Service tag. When you enter that tag at Dell you will find out what is inside the box. But that's pointless, because you can't modify anything of this ;-) It only will become interesting if you use a GSA model that can be repurposed.
This depends on a lot of factors beyond just what model/version you have.
Are the requests part of an already authenticated session?
Are you using early or late binding?
How many authentication mechanisms are you using?
What's the flex authz rule order?
What's the permit/deny ratio for the results?
Any numbers you get in response to this question will have no real meaning to any other environment. My advice would be load test your own environment and use those results for capacity planning.
With the latest software, the GSA has 50 threads dedicated for search responses. This means that it can be responding to 50 requests at any given time. If the searches take on average .5 seconds, this will mean that you can average about 100 qps.
If they take longer...you'll see this be reduced. The GSA will also queue up a few requests before responding with the appropriate http response saying the server is overloaded.

Bulk insert vs Single insert

The primary dev managing our ES cluster has made the statement that single document loads to ES will only provide us with roughly 30 / 40 creations a second. Whereas the bulk operations will give us more in the range of a 1,000+. I realize that bulk is always faster (or is generally) and there are hardware / environment constraints to any process. However, with other technologies you do not pay such a heavy price for single insertions. I am obviously ignorant when it comes to ES. Why do you pay such a heavy price for document writes in ES? Or are we just not properly informed?
Environment:
Apache Storm writes to our ES cluster
Currently all of the writes are processed in bulk operations.
What you have to take into account is the round trip time between your loader and your cluster. Setting up an http connection, transferring the data, and then waiting for a response can take a while -- in this case it seems it's taking your about 30 ms. Elasticsearch has to setup a parser for your request, hand it off to the node that is really going to do the work, and then generate the response back to you.
By using the bulk API, you remove a lot of back and forth -- ES can group together inserts going to the same node, doesn't have to instantiate a new parser for every request, etc.
HTTP Connection pooling for single requests would help, but doing bulk inserts/updates/deletes is always going to be faster in the long run.
Bulk indexing is indeed way faster but it is not as bad as you system admin suggests. Elasticsearch has gotten a lot better at this stuff over the past two years.
We're able to do hundreds of inserts/updates per second without bulking requests. Most inserts take around 1 ms (including sending the http request and receiving the response). If insert speed becomes an issue, you can back off on the cluster refresh (default 1s). Also, you can use multiple threads to insert. Bulk insert can get in the range of 10000s, depending on how complex your mappings are.
You definitely want http connection pooling (true when using any kind of webservice in anger) or even better, run an embedded elasticsearch node. Another alternative is to run an elasticsearch node on localhost if you don't want to do an embedded node. That way, all http traffic is on localhost.
Finally, if you need to support more concurrent writes, you can always increase the number of shards and nodes. These numbers are not set in stone. If you need tens of thousands of writes per second, it should be possible to engineer a cluster that can do it. It will require a lot of tuning and hardware probably, and you should probably not do this unless you have a really good reason to do so. However, the whole point of elastic search is horizontal scalability.

Metrics doesn't decay when no values are reported

I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.

Spreading out data from bursts

I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.

Resources