Resilience4J Circuit Breaker does not switch back to CLOSED state - spring-boot

I have implemented a circuit breaker for a service that is called once every 5 minutes.
Looking at the below screenshots:
Captures when the outage occurred in the service
Captures when the circuit breaker opens
All the calls to the service, after the calls that triggered the circuit breaker to OPEN, are going successfully. (Notice the timeout count = 2 for the third bar from starting)
I have the following circuit breaker configurations:
circuit-breaker:
automatic-transition-from-open-to-half-open-enabled: ${AMA_AUTOMATIC_TRANSITION_FROM_OPEN_TO_HALF_OPEN_ENABLED:true}
failure-rate-threshold: ${AMA_FAILURE_RATE_THRESHOLD:40}
max-wait-duration-in-half-open-state: ${AMA_MAX_WAIT_DURATION_IN_HALF_OPEN_STATE:5s}
permitted-number-of-calls-in-half-open-state: ${AMA_PERMITTED_NUM_OF_CALLS_IN_HALF_OPEN_STATE:10}
register-health-indicator: ${AMA_REGISTER_HEALTH_INDICATOR:true}
sliding-window-size: ${AMA_SLIDING_WINDOW_SIZE:5}
sliding-window-type: ${AMA_SLIDING_WINDOW_TYPE:COUNT_BASED}
slow-call-duration-threshold: ${AMA_SLOW_CALL_DURATION_THRESHOLD:5s}
slow-call-rate-threshold: ${AMA_SLOW_CALL_RATE_THRESHOLD:40}
wait-duration-in-open-state: ${AMA_WAIT_DURATION_IN_OPEN_STATE:5s}
writable-stack-trace-enabled: ${AMA_WRITABLE_STACK_TRACE_ENABLED:true}
minimum-number-of-calls: ${AMA_MINIMUM_NUM_OF_CALL:5}
timeout-duration: ${AMA_TIMEOUT_DURATION:5s}
Firstly, according to the configuration, at least 5 calls should fail for the circuit breaker to switch to OPEN, but from the graph, we can see that only 2 failures switch it to OPEN which is not correct.
Secondly, Why it is still in an OPEN state after 2 hours. It should have switched back to CLOSED by now.

Related

Resilience Circuit breaker config update dynamically

I have configured the C.B for Client_Service which calls to Remote_Service, in Fallback situation, it calls to Health_Service and this (called service) provides you the Future_Timestamp of Remote_Service to come-back as up & running state.
Now I want to configure my C.B as per this Future_Timestamp, if it is 2 hours later then I want to keep my C.B as Open state (so for 2 hours, no calls to Remote_Service) and post 2 hours, request can call to Remote_Service if it succed C.B can be change into Half-Open then to Closed or if it fails again, Fallback method will call to Health_Service and same pattern will follow.
Is there anyway out for this problem statement or any alternative solution?

Circuit Breaker for an orchestrator

I have a facade which calls 3 different services for some type of requests and finally orchestrates the responses before sending the response back to the client. Here, it is mandatory that all 3 services are up and serving as expected. The client request can not be served even one of them is down.
I am looking for a circuit breaker to solve this problem. The circuit breaker should respond with error code even one of the service is down. I was checking the resilence4j circuit breaker and it doesnt fit for my problem.
https://resilience4j.readme.io/docs/circuitbreaker
Is there any other open source available?
Why doesn't it fit to you problem?
You can protect every service with a CircuitBreaker. As soon one of the CircuitBreakers is open, you can short circuit and directly return an error response to your client.
CircuitBreaker Works on protected function as below –
Thread <—> CircuitBreaker <—> Protected_Function
So a Protected_Function can call 1 or more microservices, Mostly we use 1 Protected_Function for 1 external micro service call because we have can tune resilience based on the profile or behavior of that particular micro-service. But as your requirement is different so we can have 3 calls under 1 Protected_Function.
So as per your explanation above, you Façade is calling 3 micro-services (assume in series). What you can do is to call you Façade or all 3 services through or inside a Protected Function –
#CircuitBreaker(name = "OVERALL_PROTECTION")
public Your_Response Protected_Function (Your_Request) {
Call_To_Service_1;
Call_To_Service_2;
Call_To_Service_3;
return Orchestrate_Your_Response;
}
Further you can add resilience for OVERALL_PROTECTION in your YAML property file as below (I have used Count based Sliding Window) –
resilience4j.circuitbreaker:
backends:
OVERALL_PROTECTION:
registerHealthIndicator: true
slidingWindowSize: 100 # start rate calc after 100 calls
minimumNumberOfCalls: 100 # minimum calls before the CircuitBreaker can calculate the error rate.
permittedNumberOfCallsInHalfOpenState: 10 # number of permitted calls when the CircuitBreaker is half open
waitDurationInOpenState: 10s # time that the CircuitBreaker should wait before transitioning from open to half-open
failureRateThreshold: 50 # failure rate threshold in percentage
slowCallRateThreshold: 100 # consider all transactions under interceptor for slow call rate
slowCallDurationThreshold: 2s # if a call is taking more than 2s then increase the error rate
recordExceptions: # increment error rate if following exception occurs
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
You can also use time based slidingWindow instead of count based if you wish, Rest I have mentioned #Comment for self explanation in front of each parameter in configuration.
resilience4j.retry:
instances:
OVERALL_PROTECTION:
maxRetryAttempts: 5
waitDuration: 100
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
Above configuration will perform a retry for 5 times if Exceptions under retryExceptions occurs.
resilience4j.ratelimiter:
instances:
OVERALL_PROTECTION:
timeoutDuration: 100ms #The default wait time a thread waits for a permission
limitRefreshPeriod: 1000 #The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value
limitForPeriod: 25 #The number of permissions available during one limit refresh period
Above configuration will allow maximum up to 25 transactions in 1 second.

Spring-retry restarts after maxAttempts reached when no recover method is provided

Playing with spring-retry with spring-boot 1.5.21 and noticing that spring-retry restarts when maxAttempts is reached when there is no recover method implemented.
Works as expected if proper recover method is implemented. If no recover method, retry doesnt stop at maxAttempts, but restarts again. # of restarts is equal to configured maxAttempts. Eg, max attempts =3, retry will execute 9 times (running 3 retries * 3 restarts)
Using annotations to setup the retry block
#Retryable(include= {ResourceAccessException.class}, maxAttemptsExpression = "${retry.maxAttempts}", backoff = #Backoff(delayExpression = "${retry.delay}", multiplierExpression = "${retry.delay-multiplier}"))
expected results with a maxAttempts =3 is retry stops after 3 attempts
actual results is retry will restart the 3 attempts 3 more times, for a total of 9 retries.
The above occurs ONLY when no recover method is provided. Based on documentation, recover method is optional and i have no need for one, since there is no valid recovery in my case for a failed REST service call. (no redundant service available)
If there is no recoverer, the final exception is thrown.
If the source of the call is a listener container (e.g. RabbitMQ, JMS) then the delivery will be retried.
That's the whole point of a recoverer.

XBee - XBee-API and multiple endpoints

Using Andrew Rapp's XBee-API, how can I sample I/O data via a coordinator from more than two endpoints?
I have 17 Series 1 XBees. I have programmed one to be a coordinator (API mode = 2) and the rest to be endpoints. Using XBee-API I am sending a Force I/O Sample ("IS") remote AT command, unicast to each endpoint. This works perfectly well when there are up to two endpoints, but as soon as a third is added, one of the three always becomes non-responsive (times out with XBeeTimeoutException). It's not always the same physical unit that stops responding, but it is always the third one (for example, if I send Force I/O Sample to Device1, Device2, and Device3, Device3 will time out, and if I change the order to Device3, Device1, Device2, Device2 will time out.
If I set up more than three XBees, about 1 out of 3 will time out - but not every third one.
I've verified that the XBees themselves are fine. I've searched the Internet and Stack Overflow in particular to no avail. I've tried using a simple ZNetRemoteAtRequest. I've tried opening and closing the XBee coordinator serial connection once for all three devices, once per device, and once per program run. I've tried varying the distance between the coordinator and endpoints (never more than five feet apart). I've tried different coordinator configuration parameters (from the Digi documentation). I've tried changing out the XBee for the coordinator.
This is the code I'm using to send the Force I/O Sample request to each endpoint and read the response:
xbee = new XBee(); // Coordinator
xbee.open("/dev/ttyUSB0, 115200)); // Happens before any of the endpoints are contacted
... // Loop through known endpoint addresses
XBeeRequest request = new ZBForceSampleRequest(new XBeeAddress64(endpointAddress));
ZNetRemoteAtResponse response = null;
response = (ZNetRemoteAtResponse) xbee.sendSynchronous(request, remoteXBeeTimeout);
if (response.isOk()) {
// Process response payload
}
... // End loop and finally close coordinator connection
What might help polling I/O samples from more than two endpoints?
EDIT: I found that Andrew Rapp's XBee-API library fakes multithreaded behavior, which causes the synchronization issues described in this question. I wrote a replacement library that is actually multithreaded and correctly maps responses from multiple XBee endpoints: https://github.com/steveperkins/xbee-api-for-java-1-4. When I wrote it Java 1.4 was necessary for use on the BeagleBone, Plug, and Zotac single-board PCs but it's an easy conversion to 1.7+.
Are you using hardware flow control on your serial port? Is it possible that you're sending requests out when the local XBee has deasserted CTS (e.g., asking you to stop sending)? I assume you're running at 115200 bps, so the XBee serial port can keep up with the network data rate.
Can you turn on debugging information, or connect some port monitoring hardware/software to display the data going over the serial port to the local XBee?

Failed TWI transaction after sleep on Xmega

we've had some troubles with TWI/I2C after waking up from sleep with the Atmel Xmega256A3. Instead of digging into the details of TWI/I2C we've decided to use the supplied twi_master_driver from Atmel attached to AVR1308 application note.
The problem is one or a few failed TWI transactions just after waking up from sleep. On the I2C-bus connected to the XMega we have a few potentiometers, a thermometer and an RTC. The XMega acts as the only master on the bus.
We use the sleep functions found in AVRLIBC:
{code for turning of VCC to all I2C connected devices}
set_sleep_mode(SLEEP_MODE_PWR_DOWN);
sleep_enable();
sleep_cpu();
{code for turning on VCC to all I2C connected devices}
The XMega as woken from sleep by the RTC which sets a pin high. After the XMega is woken from sleep, we want to set a value on one of the potentiometers, but this fails. For some reason, the TWI-transaction result is TWIM_RESULT_NACK_RECEIVED instead of TWIM_RESULT_OK in the first transaction. After that everything seems to work again.
Have we missed anything here? Is there any known issues with the XMega, sleep and TWI? Do we need to reset the TWI of clear any flags after waking from sleep?
Best regards
Fredrik
There is a common problem on I2C/TWI where the internal state machine gets stuck in an intermediate state if a transaction is not completed fully. The slave then does not respond correctly when addressed on the next transaction. This commonly happens when the master is reset or stops outputting the SCK signal part way through the read or write. A solution is to toggle the SCK line manually 8 or 9 times before starting any data transactions so the that the internal state machines in the slaves are all reset to the start of transfer point and they are all then looking for their address byte.

Resources