How's the behaviour of circuit breaker in HALF_OPEN state (resilience4j) - circuit-breaker

Sometimes I can see a CallNotPermittedException with a message that says the circuit breaker is in HALF_OPEN state.
But I don't understand how does it work in that state.
I've written a test with a mock server where I have permittedNumberOfCallsInHalfOpenState=2
Then I enqueue 3 calls with a delay (3 seconds) and call, next call will fail with CallNotPermittedException and HALF_OPEN message.
But if I wait the 3 seconds (enough for the calls to finish) and I do the next call, the CB is now in closed state.
how's the transition from HALF_OPEN to another state ? does it wait for a time? or just "permittedNumberOfCallsInHalfOpenState" to finish ?
then why I have to make 3 calls and not 2 ?
I'm using version 1.5

The CircuitBreaker rejects calls with a CallNotPermittedException when it is OPEN. After a wait time duration has elapsed, the CircuitBreaker state changes from OPEN to HALF_OPEN and permits a configurable number of calls to see if the backend is still unavailable or has become available again. Further calls are rejected with a CallNotPermittedException, until all permitted calls have completed.
If the failure rate or slow call rate is then equal or greater than the configured threshold, the state changes back to OPEN. If the failure rate and slow call rate is below the threshold, the state changes back to CLOSED.
That means if you have 3 concurrent calls in HALF_OPEN state, two are permitted and 1 is rejected.
But if 2 calls are successful before the third call is executed, the CircuitBreaker transitions to CLOSED and does permit the third call.

Related

Proper way to trigger a flow retry?

Consider this flow:
It's a simple flow to authenticate to an HTTP API and handle success/failure. In the failure state, you can see I added a ControlRate processor and that there are 2 FlowFiles in the queue for it. I have it set to only pass one FlowFile every 30 seconds (Time Duration = 30sec Maximum Rate = 1). So the queue will continue to fill during this, if the authentication process continues to fail.
What I want is to essentially drop all but the first FlowFile in this queue, because I don't want it to continue re-triggering the authentication processor after we get a successful authentication.
I believe I can accomplish this by setting the FlowFile Expiration (on the highlighted queue) to be just longer than the 30 second Time Duration of the ControlRate processor. But this seems a bit arbitrary and not quite correct in my mind.
Is there a way to say "take first, drop rest" for the highlighted queue?

Kafka consumer not committing offset correctly

I had a Kafka consumer defined with the following properties :
session.timeout.ms = 60000
heartbeat.interval.ms = 6000
We noticed a lag of ~2000 messages and saw that the same message is being consumed multiple times by the consumer (via our app logs). Also, noticed that some of the messages were taking ~10 seconds to be completely processed. Our suspicion was that the consumer was not committing the offset properly (or was committing the same old offset repeatedly), because of which the same message was being picked up by the consumer.
To fix this, we introduced a few more properties :
auto.commit.interval.ms=20000 //To ensure that commit is happening only after processing of message is completed
max.poll.records=10 //To make the consumer pick only 10 messages in one go
And, we set the concurrency to 1.
This fixed our issue. The lag started to reduce and ultimately came to 0.
But, I am still unclear why the problem occurred in the first place.
As I understand, by default :
enable.auto.commit = true
auto.commit.interval.ms=5000
So, ideally the consumer should have been committing every 5 seconds. If the message was not completely processed within this timeframe, what happens? What offset is being committed by the consumer? Did the problem occur due to large poll record size (which is 500 by default)
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
So, originally if the poll() was earlier taking place in every 5 seconds (default auto.commit.interval), why was not it committing the latest offset? Because the consumer was still not done processing it? Then, it should have committed that offset at the next 5th second.
Can someone please answer these queries and explain why the original problem occurred?
If you are using Spring for Apache Kafka, we recommend setting enable.auto.commit to false so that the container will commit the offsets in a more deterministic fashion (either after each record, or each batch of records - the default).
Most likely, the problem was max.poll.interval.ms which is 5 minutes by default. If your batch of messages take longer than this you would have seen that behavior. You can either increase max.poll.interval.ms or, as you have done, reduce max.poll.records.
The key is that you MUST process the records returned by the poll in less than max.poll.interval.ms.
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
That is incorrect; poll() is NOT called in the background; heartbeats are sent in the background since KIP-62.

How to alert on slack about a HTTP request if it takes more time to complete than specific time

I have a http request. https://server:port//get_somthing?x=10.
Assume that it is expected to respond within 1 second. I notice this in newrelic that, sometimes it takes 1.5, 2 seconds. I would like to look into the logs and investigate whenever it takes more time than 1 second. So I want to set up alerts on a slack channel whenever request takes more time than the prescribed time.
How to achieve this?? I am using newrelic's java agent.

Azure Web Api - Waiting Sql Connection every 4 minutes and 30 minutes

Within a request on an ApiController, I'm tracking the duration of awaiting the Sql Connection to open.
await t.TrackDependencyAsync(async() => { await sqlConnection.OpenAsync(); return true; }, "WaitingSqlConnection");
If my request is not called for at least 5 minutes, then any new call will see the duration of OpenAsync be huge (c. 3s) instead of immediate.
I'd like to understand the reason to eradicate that crazy slowness.
UPDATE
I created an endpoint just to open the SqlConnection. If I wait more than 5 minutes then call that OpenConnection endpoint then call any another request, the OpenConnection will incur the waiting cost mentioned above but the request will not.
Hence, I've scheduled a job on Azure to run every minute and call the OpenConnection endpoint. However, when I make requests from my http client, I incur the waiting time. As if opened the SqlConnection was somehow linked to the http client ip...
Also, that 5 minutes windows is typical of DNS TTL... However 3s for a DNS lookup of the Database endpoint is too long. It can't be that.
UPDATE 2
Time observed at the htt client level seems to be the result of both awaiting for the connection as well as some other latencies (dns lookup?).
Here is a table summarizing what I observe:
UPDATE 3
The difference between row 3 and 4 of my table is time spent in TCP/IP Connect and HTTPS Handshake according to Fiddler. Let's not focus on it on that post but only on the time spent waiting for the SqlConnection to open.
UPDATE 4
Actually I think both two waiting times have the same reason.
The server needs to "keep alive" its connection to the database and the client needs to "keep alive" its connection to the server.
UPDATE 5
I had a job running every 4 minutes to open the SqlConnection but once in a while it was incurring the waiting cost. So I think the inactivity time is 4 minutes not 5 (hence I updated this post title).
So I updated my scheduled job to run every minute. then I realised it was still incurring the waiting cost but regularly every 30 minutes (hence I updated this post title).
These two times strangely correlates with those of Azure Load Balancer Idle Timeout.

Read and Write timeouts behavior

What is the behavior of read and write timeouts in OkHttp?
Is the timeout exception triggered when the whole request exceeds the timeout duration or is when the socket doesn't receive (read) or send (write) any packet for this duration.
I think is the second behavior but could someone clarify this?
Thanks in advance.
The timeouts are triggered when you block for too long. On read that occurs if the server doesn't send you response data. On write it occurs if the server doesn't read the request you sent. Or if the network makes it seem like that's what's happening!
Timeouts are continuous: if the timeout is 3 seconds and the response is 5 bytes, an extreme case might succeed in 15 seconds just as long as the server sends something every 3 seconds. In other words, the timeout is reset after ever successful I/O.
Okio’s Timeout class also offers a deadline abstraction that is concerned with the total time spent.

Resources