Resilience4j behavior for permittedNumberOfCalls InHalfOpenState - circuit-breaker

We are using resilience4j with circuit breaker and following configuration
slidingWindowSize: 60
slidingWindowType: TIME_BASED
minimumNumberOfCalls: 100
waitDurationInOpenState: 5s
failureRateThreshold: 50
permittedNumberOfCallsInHalfOpenState:6000
by this github issue answer, it says that circuit breaker will allow permittedNumberOfCallsInHalfOpenState in HALF_OPEN state and then calculate the failure threshold. But our sliding window size is 5s.
In order to change state, does the circuit breaker wait till all 6000 calls are completed irrespective of sliding windows size or it will calculate within next sliding window?
For example, if we are allowing only 300 calls (using rate limiter) per slidingwindowsize which is 60s. Then if circuit breaker waits for all 6000 calls to complete before deciding the state, then it must wait for next 20 minutes. But if circuit breaker gives preference to sliding window, then it must decide the state in next 60s
What is the behaviour in this case?

Related

How to configure Spring SimpleMessageListenerContainer receiveTimeout in order to scale up to a reasonable number of consumers

Use case
A backend consuming messages at various rate and inserting the messages in a DB.
Today in production my SimpleMessageListenerContainer scales to maxConcurrentConsumers even if is not necessary to handle the traffic rate.
Problem
I try to find the proper configuration of spring SimpleMessageListenerContainer in order to let spring scale up/down the number of consumers to the adequate number in order to handle the incoming traffic.
With a a fix injection rate, on a single node rabbitmq I have noticed that the scaling process stabilize at
numberOfConsumers = (injectionRate * receiveTimeoutInMilliseconds) / 1000
For example :
injection rate : 100 msg/s
container.setReceiveTimeout(100L); // 100 ms
--> consumers 11
--> Consumer capacity 100%
injection rate : 100 msg/s
container.setReceiveTimeout(1000L); // 1 s - default
--> consumers 101
--> Consumer capacity 100%
Knowing that more consumers means more threads and more amqp channels
I am wondering why the scaling algorithm is not linked to the consumerCapacity metric and why is the default receive timeout set to 1 second ?
See the documentation https://docs.spring.io/spring-amqp/docs/current/reference/html/#listener-concurrency
In addition, a new property called maxConcurrentConsumers has been added and the container dynamically adjusts the concurrency based on workload. This works in conjunction with four additional properties: consecutiveActiveTrigger, startConsumerMinInterval, consecutiveIdleTrigger, and stopConsumerMinInterval. With the default settings, the algorithm to increase consumers works as follows:
If the maxConcurrentConsumers has not been reached and an existing consumer is active for ten consecutive cycles AND at least 10 seconds has elapsed since the last consumer was started, a new consumer is started. A consumer is considered active if it received at least one message in batchSize * receiveTimeout milliseconds.
With the default settings, the algorithm to decrease consumers works as follows:
If there are more than concurrentConsumers running and a consumer detects ten consecutive timeouts (idle) AND the last consumer was stopped at least 60 seconds ago, a consumer is stopped. The timeout depends on the receiveTimeout and the batchSize properties. A consumer is considered idle if it receives no messages in batchSize * receiveTimeout milliseconds. So, with the default timeout (one second) and a batchSize of four, stopping a consumer is considered after 40 seconds of idle time (four timeouts correspond to one idle detection).
Practically, consumers can be stopped only if the whole container is idle for some time. This is because the broker shares its work across all the active consumers.
So, when you reduce the receiveTimeout you would need a corresponding increase in the idle/active triggers.
The default is 1 second to provide a reasonable compromise between spinning an idle consumer while retaining responsive behavior to a container stop() operation (idle consumers are blocked for the timeout). Increasing it will cause a less responsive container (for stop()).
It is generally unnecessary to set it lower than 1 second.

Constant Throughput Timer seems to only gauge for 1 min

I am trying to run a post request in Jmeter. I want 10 requests to fire per second over a period of 1 hour. How could I achieve this?
Looking around, Constant Throughput Timer seems to be the popular option.
But for some reason, no matter what I switch around, I end up with only 500 requests. Can I please get some guidance as to why? It feels like such a basic option yet I simply can't figure it out. Been at it for hours and just not going anywhere.
My settings (For testing just trying with 2 mins, so I expect to end up with 1200 requests).
Thread Group:
Number of threads: 20
Ramp Up Period: 1
Scheduler checked.
Duration Set for 120 seconds (2 mins).
I then go on to add the Constant Throughput Timer. I set the value to 600 (Thus 10 requests per second).
As mentioned above, running this gives me 500 requests... I was expecting 1200 requests.. Why? Even if I extent mu duration to 3 mins, it would still be 500. Please help.
Constant Throughput Timer can only pause the threads to the desired throughput so if you want to achieve 10 requests per second with 20 users your application must be at most 500ms, if it will be higher - the number of requests per unit of time will be proportionally less.
So first of all try increasing the number of threads
Make sure to follow JMeter Best Practices (just in case JMeter is not capable of sending requests fast enough)
You may find Concurrency Thread Group and Throughput Shaping Timer more convenient and precise, moreover this combination can kick off extra threads if current amount is not enough in order to reach/maintain the defined throughput

The correct use of timers in a thread group (Until now my timers get ignored)

My goal is to simulate 500 users that perform certain requests on the website in an amount of time of five minutes.
To make the test come as as close as possible to reality, I want to add a thinking time between requests (here: two seconds). The problem is no matter what I do, the timers get ignored. To give you an example, I would like to perform an login request every 2 seconds. Here is data of the thread group:
Number of Threads: 500
Ramp-Up Period: 300
Loop Count: 1
So what I did do till now to achieve this:
I used the constant timer and put it at as a child to my request, that didnt work, timer gets just ignored, no matter what value I use.
I tried the constant throughput timer, but that didnt work too, values get ignored.
What am I doing wrong. I added a screenshot so you are able to see where I did put the constant timer in my test plan.
Screenshots of my testplan:
In your case you can work without timers, you can use the Ramp up period to be Number of threads * 2 (seconds) to start Thread every 2 seconds approximately.
So in your case just put Ramp-Up Period: 1000 (and remove timer)
You are using wrong timer, Constant Timer just adds delay of 5 seconds before each request. If you want JMeter to perform login each 2 seconds you should consider switching to Constant Throughput Timer
Remember that Constant Throughput Timer acts precisely enough on minute level only so you might need to play with ramp-up period on Thread Group level in order to limit threads execution rate during first 60 seconds. Alternatively you can consider using Throughput Shaping Timer plugin

How are server hits/second more than active thread count? | Jmeter

I'm running a load test to test the throughput of a server by making HTTP requests through JMeter.
I'm using the Thread Stepper plugin that allows me to increase the number of threads I'm using to make the requests after a particular time period.
The following graphs show the number of active threads with time and another one shows the corresponding hits per second I was able to make.
The third graph shows the latencies of the requests. The fourth one shows the response per second.
I'm not able to correlate the four graphs together.
In the server hits per second, I'm able to make a maximum of around 240 requests per second with only 50 active threads. However, the latency of the request is around 1 second.
My understanding is that a single thread would make a request, and then wait for the response to return before making the second request.
Since the minimum latency in my case is around 1 second, how is JMeter able to hit 240 requests per second with only 50 threads?
Server hits per second, max of 240 with only 50 threads. How?
Response latencies (minimum latency of 1 sec)
Active threads with time (50 threads when server hits are 240/sec)
Response per second (max of 300/sec, how?)
My expectation is that the reasons could be in:
Response time is less than 1 second therefore JMeter is able to send more than one request per second with every thread
It might also be connected with HTTP redirections and/or Embedded Resources processing, as per plugin's documentation:
Hits uncludes child samples from transactions and embedded resources hits.
For example this single HTTP Request with 1 single user results in 20 sub-samples which are being counted by the "Server Hits Per Second" plugin.
I took some time at analyzing the four graphs you provided and it seems to make sense that Jmeter Graphs are plotted reasonably well (since you feel the Jmeter is plotting incorrectly I will try to explain why the graphs look normal to me) .Taking clue from the point 1 of the answer that #Dmitri T provided I start the below analysis:
1 . Like pointed by #Dimitry T, the number of responses are coming in more faster than than the number of hits(requests) sent to the server; which can be seen from the Number of responses/second graph as the first batch of hits is sent at -between 50 to 70 from 0 to first five minutes . The responses for this set of requests come a a much faster rate in i.e at 60 to 90 from 0 to the first five minutes.. the same trend is observed for the set of hits fired from five to 10 minutes (responses come faster than the requests(hits) i.e 100 to 150 responses compared to 85 to 130 hits) ...Hence by the continuous tned the Load Generator is able to send more hits and more hits and more hits for the 50 active threads...which gives the upwards positive slope coupled with the Thread Stepper plugin's capability..
Hence the hits and responses graph are in lock step pattern(marching in unison) with the response graph having a better slope compared to hits per second graph.
This upwards happy happy trend continues till the queuing effect due to entire processing capacity use ,takes place at 23 minutes. This point in time all the graphs seems to have a opposite effect of what they were doing up till now i.e for 22.59 minutes.
The response latency (i.e the time taken to get the response is increased from 23rd minute on . At the same time there is a drop in hits per second(maybe due to not enough threads available to load generator o fire next request as they(threads aka users) are in queue and have not exited the process to make the next request). This drop in requests have dropped the rate of receiving responses as seen from the number of responses graph. But still you can see "service center" still processing the requests efficiently i.e sending back request faster the arriving rate i.e as per queuing theory the service rate is faster then the arrival rate and hence reinforcing point 1 of our analysis.
At 60 users load .Something happens ..Queuing happens!!(Confirm this by checking drop in response time graph with Throughput graph drop at the same time.If yes then requests were piped-up at the server i.e queued.) and this is the point where all the service centers are busy.and hence a drop in response time which impact the user threads from being able to generate a new hits causing low in hits per second.
The error codes observed in number of responses per second graph namely the 400,403,500 and 504 seem to part of the response codes all, from the 10th user load onwards which may indicate a time bound or data issue(first 10 users of your csv have proper data in database and the rest don't)..
Or it could be with the "credit" or "debit" transaction since chances are both may conflict...or be deadlocked on a Bank account etc.
If you notice the nature of all the error codes they can be seen to be many where more volume of responses are received i.e till 23 minute and reduced in volume since the level of responses are less due to queuing from 23rd minute on wards.Hence directly proportional with response codes. The 504 (gateway timeout) error which is a sure sign of lot of time taken to process and the web server timing out means the load is high..so we can consider the load till 80 users ..i.e at 40th minute as a reasonable load bearing capacity of the system(Obliviously if more 504 errors are observed we can fix that point as the unstressed load the system can handle.)
***Important: Check your HITS per second Graph configuration :Another observation is that the metering parameter to plot the graph could be not in sync with the expected scale i.e per second .Since you are expecting Hits in seconds but in your Hits per second graph you per configuration to plot could be 500 ms i.e half a second.so this could cause the plotting to go up high i.e higher than 50hits per 50 users ..

Hystrix Configuration

I am trying to implement hystrix for my application using hystrix-javanica.
I have configured hystrix-configuration.properties as below
hystrix.command.default.execution.isolation.strategy=SEMAPHORE
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds=10000 
hystrix.command.default.fallback.enabled=true
hystrix.command.default.circuitBreaker.enabled=true
hystrix.command.default.circuitBreaker.requestVolumeThreshold=3 
hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds=50000
hystrix.command.default.circuitBreaker.errorThresholdPercentage=50
short-circuit pattern is working fine but i have a doubt in this hystrix.command.default.circuitBreaker.requestVolumeThreshold=3
Is it stating open the circuit after 3 failures
or
Open the circuit after 3 concurrent failures.
Gone through the documentation link
Can anybody answer?
 
 
How Hystrix Circuit-Breaker operates: Hystrix does not offer a circuit breaker which breaks after a given number of failures. The Hystrix circuit will break if:
within a timespan of duration metrics.rollingStats.timeInMilliseconds, the percentage of actions resulting in a handled exception exceeds errorThresholdPercentage, provided also that the number of actions through the circuit in the timespan is at least requestVolumeThreshold
What is requestVolumeThreshold?
requestVolumeThreshold is a minimum threshold for the volume (number) of calls through the circuit that must be met (within the rolling window), before the circuit calculates a percentage failure rate at all. Only when this minimum volume (in each time window) has been met, will the circuit compare the failure proportion of your calls against the errorThresholdPercentage you have configured.
Imagine there was no such minimum-volume-through-the-circuit threshold. Imagine the first call in a time window errors. You would have 1 of 1 calls being an error, = 100% failure rate, which is higher than the 50% threshold you have set. So the circuit would break immediately.
The requestVolumeThreshold exists so that this does not happen. It's effectively saying, the error rate through your circuit isn't statistically significant (and won't be compared against errorThresholdPercentage) until at least requestVolumeThreshold calls have been received in each time window.
I am rather new to hystrix but I guess I can help you.
In general hystrix.command.default.circuitBreaker.requestVolumeThreshold is a property that sets the minimum number of requests in a rolling window that will trip the circuit and its default value is 20 and its value can be changed in properties file or in our #HystrixCommand annotated method.
For example, if that property value is 20, then if only 19 requests are received in the rolling window (say a window of 10 seconds) the circuit will not trip open even if all 19 failed. If the failed request value reaches 20, then the circuit will be opened and the corresponding calls will be sent to fallback even if the call succeeds, till the sleeping window time period complete.
Sleeping window time period sets the amount of time, after tripping the circuit, to reject requests before allowing attempts again to determine if the circuit should again be closed. Its value is defaulted to 5000 milliseconds. This can be changed by overriding circuitBreaker.sleepWindowInMilliseconds property.
You can find all the properties and its description here.

Resources