Circuit Breaker for an orchestrator - circuit-breaker

I have a facade which calls 3 different services for some type of requests and finally orchestrates the responses before sending the response back to the client. Here, it is mandatory that all 3 services are up and serving as expected. The client request can not be served even one of them is down.
I am looking for a circuit breaker to solve this problem. The circuit breaker should respond with error code even one of the service is down. I was checking the resilence4j circuit breaker and it doesnt fit for my problem.
https://resilience4j.readme.io/docs/circuitbreaker
Is there any other open source available?

Why doesn't it fit to you problem?
You can protect every service with a CircuitBreaker. As soon one of the CircuitBreakers is open, you can short circuit and directly return an error response to your client.

CircuitBreaker Works on protected function as below –
Thread <—> CircuitBreaker <—> Protected_Function
So a Protected_Function can call 1 or more microservices, Mostly we use 1 Protected_Function for 1 external micro service call because we have can tune resilience based on the profile or behavior of that particular micro-service. But as your requirement is different so we can have 3 calls under 1 Protected_Function.
So as per your explanation above, you Façade is calling 3 micro-services (assume in series). What you can do is to call you Façade or all 3 services through or inside a Protected Function –
#CircuitBreaker(name = "OVERALL_PROTECTION")
public Your_Response Protected_Function (Your_Request) {
Call_To_Service_1;
Call_To_Service_2;
Call_To_Service_3;
return Orchestrate_Your_Response;
}
Further you can add resilience for OVERALL_PROTECTION in your YAML property file as below (I have used Count based Sliding Window) –
resilience4j.circuitbreaker:
backends:
OVERALL_PROTECTION:
registerHealthIndicator: true
slidingWindowSize: 100 # start rate calc after 100 calls
minimumNumberOfCalls: 100 # minimum calls before the CircuitBreaker can calculate the error rate.
permittedNumberOfCallsInHalfOpenState: 10 # number of permitted calls when the CircuitBreaker is half open
waitDurationInOpenState: 10s # time that the CircuitBreaker should wait before transitioning from open to half-open
failureRateThreshold: 50 # failure rate threshold in percentage
slowCallRateThreshold: 100 # consider all transactions under interceptor for slow call rate
slowCallDurationThreshold: 2s # if a call is taking more than 2s then increase the error rate
recordExceptions: # increment error rate if following exception occurs
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
You can also use time based slidingWindow instead of count based if you wish, Rest I have mentioned #Comment for self explanation in front of each parameter in configuration.
resilience4j.retry:
instances:
OVERALL_PROTECTION:
maxRetryAttempts: 5
waitDuration: 100
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
Above configuration will perform a retry for 5 times if Exceptions under retryExceptions occurs.
resilience4j.ratelimiter:
instances:
OVERALL_PROTECTION:
timeoutDuration: 100ms #The default wait time a thread waits for a permission
limitRefreshPeriod: 1000 #The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value
limitForPeriod: 25 #The number of permissions available during one limit refresh period
Above configuration will allow maximum up to 25 transactions in 1 second.

Related

Resilience4j circuit breaker doesn't open when slowCallRateThreshold is reached?

I've decorated the following method with a Resilience4j CircuitBreaker:
#CircuitBreaker(name = "getSwaggerFileName")
private Optional<String> getSwaggerFileName(String url) {
return Optional.ofNullable(restTemplate.getForObject(url, SwaggerQueryResponse.class))
.flatMap(SwaggerQueryResponse::getFileName);
}
My circuit breaker config is stored in application.yaml:
resilience4j:
circuitbreaker:
configs:
default:
slowCallRateThreshold: 50
slowCallDurationThreshold: 1
slidingWindowSize: 20
waitDurationInOpenState: 60000
instances:
getSwaggerFileName:
baseConfig: default
This file is definitely being found, and a similar setup works fine with the #RateLimiter annotation. I've lowered the slowCallDurationThreshold from 2000ms to 1ms, and even tried introducing a Thread.sleep call to the method to verify it's taking longer than the duration threshold. The method runs around 100 times (edit: and the problem still occurs when slidingWindowSize and minimumNumberOfCalls are set well below this number).
But my circuit breaker doesn't open, after around 100 calls to this method. As I see it, it should open if more than 50% of those calls take more than 1ms. Am I missing something?
It's because the method is private. The method must be public so that it can be proxied.

How Kafka retries works with request.timeout.?

I have configured my Producer with request.timeout.ms = 70,0000ms and retries=5. I have doubt how this actually works,
After this "request.timeout.ms=70,000" expires it retries for 5 times or within given "request.timeout.ms=70,000" it retries for 5 time with retry.backoff.ms value.?
There are 3 important configs to be aware of:
"request.timeout.ms" - time to retry a single request
"delivery.timeout.ms" - time to complete the entire send operation
"retries" - how many times to retry when the broker responds with retriable errors.
The Apache Kafka recommendation is to set "delivery.timeout.ms" and leave the other two configurations with their default value. The idea is that the main thing you as a user should worry about is how long you want to way for Kafka to figure things out before giving up on it. It doesn't really matter what is taking Kafka so long - the connection, getting metadata, long queues, etc, the only thing that matters is how long you are willing to wait.
Now to your question - request.timeout.ms applies on each retry. So Producer will send the recordbatch to Kafka, and if there's no response after 70,000ms it will consider this a failure and retry. Note that most errors (say, NoLeaderForPartition) will return from the broker much faster (which is why retry backoffs are needed).
Reasoning about delivery times with retries + request.timeout.ms turned out to be near impossible even for those who wrote the producer. Hence, the introduction of delivery.time.ms with a very clear contract.

Gatling - problem with users creation with rampUsers when other scenarios are working

I have spotted some strange behaviour of gatling.
I have 5 scenarios which have such "setup":
scn[00-04].inject(constantUsersPerSec(simulationConfig.UsersPerSec) during
(simulationConfig.maxDurationSeconds seconds).max(simulationConfig.maxDurationSeconds)).protocols(protocol),
UsersPerSec = ~0.8
maxDurationSeconds = 240
and additionaly one scenario with:
scn05.inject(nothingFor(120 seconds), rampUsers(100) during(2 seconds))
.protocols(protocol)
So whole load last 240 seconds and after 120 seconds additional load should be created during 2 seconds. What I have observed:
after starting creation of additional 100 new "users", old "users" from first 5 scenarios did not execute requests / stay inactive some time (why?)
amount of 100 "users" is not reached after start ( internally only one blocking request to soap service is sent during one action - it can take from 5 seconds to ~30seconds ), it stays on about level of 70 (dark blue line) but it should reach amount of 100 "users" (why?)
OK, I have created another test-project with:
scn00.inject(rampUsers(100) during(2 seconds))
.protocols(protocol),
scn01.inject(nothingFor(20 seconds),rampUsers(100) during(2 seconds))
.protocols(protocol)
In those scenarios I had left only
.feed(customFeeder)
.pause(1 seconds)
.exec(customAction("Action X")((testData) => {
val y = testData.getConfig -- sometimes reads shared file during initialization
Thread.sleep(20000)
}))
and results looked:
why in this chart we can see that:
amount of 100 users was not reached? rampUsers(100) during (2 seconds)
in that point this scenario should ends ( it contains almost only Thread.sleep(20000) )
why: second scenario could reach amount of 100 users even if it differs only with nothingFor(20 seconds), additionaly: why this scenario did not start after 20seconds but 40 seconds?
why: second scenario was frozen for 60 seconds and started to finishing after first scenario was completed by all users?
why it did not looked more or less like this:
I guess you're confusing injection profile for something that would drive virtual users lifespan.
Injection profiles only drive when users are injected/started.
Lifespan is driven by your scenarios, once a user reach the end of its scenario, it terminates.

Socket Exception while running load test with Self Provisioned test rig

I am getting the Socket Exception while running load test on self-provisioned test rig.
I am trigger those load tests in agent machine(self-provisioned test rig) from my local machine.
Note : For first 2 to 3 minutes test iterations are passing , after that we are getting the Socket Exception.
Below is the error message :
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond.
Below are the stack trace details :
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at
System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure,
Socket s4, Socket s6, Socket& socket, IPAddress& address,
ConnectSocketState state, IAsyncResult asyncResult, Exception&
exception)
Run Time - 20min
Sample rate - 10sec
warm up duration 10sec
number of agents used - 2
Load pattern :
initial load - 10user
max user count - 300
step duration - 10sec
step user count - 10
Although, by Changing above values I am still getting the exception in the same way.
I am using Visual studio 2015 enterprise.
The question states: start with 10 users, every 10 seconds add 10 users to a maximum of 300. Thus after 29 increments there will be 300 users and that will take 29*10 seconds which is 4m50s. The test will thus (attempt to) run with the maximum load of 300 users for the remaining 15m10s.
Given that all tests pass for the first 2 or 3 minutes plus the the error message, that suggests that you are overloading some part of the network. It might be the agents, it might be the servers or it might be on the connections between them. Some network components have a maximum number of active connections and the 300 users might be too many.
Increasing the load so rapidly means you do not clearly know what the limiting value. The sampling rate (at 10 seconds) seems high. At each sampling interval a lot of data is transferred (i.e. the sample data) and that can swamp parts of the network. You should look at the network counters for the agents and controller, also the servers if available.
I recommend changing the load test steps to add 10 users every 30 seconds, so it takes about 15 minutes to reach 300 users. It may also be worth reducing the sample rate to every 20 seconds.

Jmeter - I have run 2 test cases but result seems odd

I have run load testing for website but when I have increased no. of users , I can see throughput time seems increasing instead of decrease.
Test Case 1 :
No. of Threads : 15
Ramp up time : 450 [As I want to put delay of 30 seconds between 2 users]
Loop count : Forever
Scheduler : 1800 Seconds [As I want to run test for 30 minutes]
In Http requests I have added 10 pages and each request has constant timer with 30000 miliseconds as I need to put delay of 30 seconds between 2 requests.
Now When I see result of Aggregate Report , it shows me Throughput 3/min for each request.
Test Case 2 :
No. of Threads : 30
Ramp up time : 900 [As I want to put delay of 30 seconds between 2 users]
Loop count : Forever
Scheduler : 1800 Seconds [As I want to run test for 30 minutes]
In Http requests I have added 10 requests/pages and each request has constant timer with 30000 miliseconds as I need to put delay of 30 seconds between 2 requests.
Now When I see result of Aggregate Report , it shows me Throughput 6/min for each request.
I am confuse that how it is possible? If my users are increased from 15 to 30 then it should have more load on server and throughtput should decrease like 1/min or 2/min.
Please let me know what I am doing wrong here.
Throughput is no. of completions per unit time. (A completion can be a http request/db request in short anything that needs to be executed and needs >0 execution time.)
Ex. req per sec or req per min etc.
By definition of throughput in JMeter, it is calculated as total no. of requests/total time.
In your first case, no. of requests generated in 1800 seconds with 3 second delay in every request by 15 users are x. Thus throughput is x/30 i.e. 3 it means ~90 requests were generated (verify this from aggregate report or other reporter.)
In your second case, everything else is same but no. of users are doubled which creates ~double no. of requests in given time which is (1800 seconds)
Thus according to formula, no. of requests generated/total time.
Throughput in 2nd case = 2x/30 = 2*throughput in 1st case
Which is 6/min. (Correctly shown by JMeter.)
Key here is to check no. of requests generated in both cases.
I hope this clears your confusion. Let me know if you need further clarification. BTW "when I have increased no. of users , I can see throughput time seems increasing instead of decrease." is not always true.
Throughput increased by factor of 2.
Test Case 1: - 3 requests per minute - 1 request each 20 seconds
Test Case 2: - 6 requests per minute - 1 request each 10 seconds
As per JMeter Glossary:
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time).
You may also be interested in the following plugins:
Server Hits Per Second
Transactions Per Second
or alternatively Loadosophia.org service which can convert your JMeter .jtl results files into easy-understandable professional load report

Resources