I am trying to implement hystrix for my application using hystrix-javanica.
I have configured hystrix-configuration.properties as below
hystrix.command.default.execution.isolation.strategy=SEMAPHORE
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds=10000
hystrix.command.default.fallback.enabled=true
hystrix.command.default.circuitBreaker.enabled=true
hystrix.command.default.circuitBreaker.requestVolumeThreshold=3
hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds=50000
hystrix.command.default.circuitBreaker.errorThresholdPercentage=50
short-circuit pattern is working fine but i have a doubt in this hystrix.command.default.circuitBreaker.requestVolumeThreshold=3
Is it stating open the circuit after 3 failures
or
Open the circuit after 3 concurrent failures.
Gone through the documentation link
Can anybody answer?
How Hystrix Circuit-Breaker operates: Hystrix does not offer a circuit breaker which breaks after a given number of failures. The Hystrix circuit will break if:
within a timespan of duration metrics.rollingStats.timeInMilliseconds, the percentage of actions resulting in a handled exception exceeds errorThresholdPercentage, provided also that the number of actions through the circuit in the timespan is at least requestVolumeThreshold
What is requestVolumeThreshold?
requestVolumeThreshold is a minimum threshold for the volume (number) of calls through the circuit that must be met (within the rolling window), before the circuit calculates a percentage failure rate at all. Only when this minimum volume (in each time window) has been met, will the circuit compare the failure proportion of your calls against the errorThresholdPercentage you have configured.
Imagine there was no such minimum-volume-through-the-circuit threshold. Imagine the first call in a time window errors. You would have 1 of 1 calls being an error, = 100% failure rate, which is higher than the 50% threshold you have set. So the circuit would break immediately.
The requestVolumeThreshold exists so that this does not happen. It's effectively saying, the error rate through your circuit isn't statistically significant (and won't be compared against errorThresholdPercentage) until at least requestVolumeThreshold calls have been received in each time window.
I am rather new to hystrix but I guess I can help you.
In general hystrix.command.default.circuitBreaker.requestVolumeThreshold is a property that sets the minimum number of requests in a rolling window that will trip the circuit and its default value is 20 and its value can be changed in properties file or in our #HystrixCommand annotated method.
For example, if that property value is 20, then if only 19 requests are received in the rolling window (say a window of 10 seconds) the circuit will not trip open even if all 19 failed. If the failed request value reaches 20, then the circuit will be opened and the corresponding calls will be sent to fallback even if the call succeeds, till the sleeping window time period complete.
Sleeping window time period sets the amount of time, after tripping the circuit, to reject requests before allowing attempts again to determine if the circuit should again be closed. Its value is defaulted to 5000 milliseconds. This can be changed by overriding circuitBreaker.sleepWindowInMilliseconds property.
You can find all the properties and its description here.
Related
One of my colleagues asked me this question what the difference between Circuit Breaker and Retry is but I was not able answer him correctly. All I know circuit breaker is useful if there is heavy request payload, but this can be achieve using retry. Then when to use Circuit Breaker and when to Retry.
Also, it is it possible to use both on same API?
The Retry pattern enables an application to retry an operation in hopes of success.
The Circuit Breaker pattern prevents an application from performing an operation that is likely to fail.
Retry - Retry pattern is useful in scenarios of transient failures. What does this mean? Failures that are "temporary", lasting only for a short amount of time are transient. A momentary loss of network connectivity, a brief moment when the service goes down or is unresponsive and related timeouts are examples of transient failures.
As the failure is transient, retrying after some time could possibly give us the result needed
Circuit Breaker - Circuit Breaker pattern is useful in scenarios of long lasting faults. Consider a loss of connectivity or the failure of a service that takes some time to repair itself. In such cases, it may not be of much use to keep retrying often if it is indeed going to take a while to hear back from the server. The Circuit Breaker pattern wants to prevent an application from performing an operation that is likely to fail.
The Circuit Breaker keeps a tab on the number of recent failures, and on the basis of a pre-determined threshold, determines whether the request should be sent to the server under stress or not.
Several years ago I wrote a resilience catalog to describe different mechanisms. Originally I've created this document for co-workers and then I shared it publicly. Please allow me to quote here the relevant parts.
Retry
Categories: reactive, after the fact
The relation between retries and attempts: n retries means at most n+1 attempts. The +1 is the initial request, if it fails (for whatever reason) then retry logic kicks in. In other words, the 0th step is executed with 0 delay penalty.
There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will be gone sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:
The potentially introduced observable impact is acceptable
The operation can be redone without any irreversible side effect
The introduced complexity is negligible compared to the promised reliability
Let’s review them one by one:
The word failure indicates that the effect is observable by the requester as well, for example via higher latency / reduced throughput / etc.. If the “penalty“ (delay or reduced performance) is unacceptable then retry is not an option for you.
This requirement is also known as idempotent operation. If I call the action with the same input several times then it will produce the exact same result. In other words, the operation acts like it only depends on its parameter and nothing else influences the result (like other objects' state).
This condition is even though one of the most crucial, this is the one that is almost always forgotten. As always there are trade-offs (If I introduce Z then it will increase X but it might decrease Y).
We should be fully aware of them otherwise it will give us some unwanted surprises in the least expected time.
Circuit Breaker
Categories: proactive, before the fact
It is hard to categorize the circuit breaker because it is pro- and reactive at the same time. It detects that a given downstream system is malfunctioning (reactive) and it protects the downstream systems from being flooded with new requests (proactive).
This is one of the most complex patterns mainly because it uses different states to define different behaviours. Before we jump into the details lets see why this tool exists at all:
Circuit breaker detects failures and prevents the application from trying to perform the action that is doomed to fail (until it is safe to retry) - Wikipedia
So, this tool works as a mini data and control plane. The requests go through this proxy, which examines the responses (if any) and it counts subsequent failures. If a predefined threshold is reached then the transfer is suspended temporarily and it fails immediately.
Why is it useful?
It prevents cascading failures. In other words the transient failure of a downstream system should not be propagated to the upstream systems. By concealing the failure we are actually preventing a chain reaction (domino effect) as well.
How does it know when a transient failure is gone?
It must somehow determine when would be safe to operate again as a proxy. For example it can use the same detection mechanism that was used during the original failure detection. So, it works like this: after a given period of time it allows a single request to go through and it examines the response. If it succeeds then the downstream is treated as healthy. Otherwise nothing changes (no request is transferred through this proxy) only the timer is reset.
What states does it use?
The circuit breaker can be in any of the following states: Closed, Open, HalfOpen.
Closed: It allows any request. It counts successive failed requests.
If the successive failed count is below the threshold and the next request succeeds then the counter is set back to 0.
If the predefined threshold is reached then it transitions into Open
Open: It rejects any request immediately. It waits a predefined amount of time.
If that time is elapsed then it transitions into HalfOpen
HalfOpen: It allows only one request. It examines the response of that request:
If the response indicates success then it transitions into Closed
If the response indicates failure then it transitions back to Open
Resiliency strategy
The above two mechanisms / policies are not mutually exclusive, on the contrary. They can be combined via the escalation mechanism. If the inner policy can't handle the problem it can propagate one level up to an outer policy.
When you try to perform a request while the Circuit Breaker is Open then it will throw an exception. Your retry policy could trigger for that and adjust its sleep duration (to avoid unnecessary attempts).
The downstream system can also inform upstream that it is receiving too many requests with 429 status code. The Circuit Breaker could also trigger for this and use the Retry-After header's value for its sleep duration.
So, the whole point of this section is that you can define a protocol between client and server how to overcome on transient failures together.
We are using resilience4j with circuit breaker and following configuration
slidingWindowSize: 60
slidingWindowType: TIME_BASED
minimumNumberOfCalls: 100
waitDurationInOpenState: 5s
failureRateThreshold: 50
permittedNumberOfCallsInHalfOpenState:6000
by this github issue answer, it says that circuit breaker will allow permittedNumberOfCallsInHalfOpenState in HALF_OPEN state and then calculate the failure threshold. But our sliding window size is 5s.
In order to change state, does the circuit breaker wait till all 6000 calls are completed irrespective of sliding windows size or it will calculate within next sliding window?
For example, if we are allowing only 300 calls (using rate limiter) per slidingwindowsize which is 60s. Then if circuit breaker waits for all 6000 calls to complete before deciding the state, then it must wait for next 20 minutes. But if circuit breaker gives preference to sliding window, then it must decide the state in next 60s
What is the behaviour in this case?
Here is some sample code I've seen.
int expBackoff = (int) Math.pow(2, retryCount);
int maxJitter = (int) Math.ceil(expBackoff*0.2);
int finalBackoff = expBackoff + random.nextInt(maxJitter);
I was wondering what's the advantage of using a random jitter here?
Suppose you have multiple clients that send messages that collide. They all decide to back off. If they use the same deterministic algorithm to decide how long to wait, they will all retry at the same time -- resulting in another collision. Adding a random factor separates the retries.
It smooths traffic on the resource being requested.
If your request fails at a particular time, there's a good chance other requests are failing at almost exactly the same time. If all of these requests follow the same deterministic back-off strategy (say, retrying after 1, 2, 4, 8, 16... seconds), then everyone who failed the first time will retry at almost exactly the same time, and there's a good chance there will be more simultaneous requests than the service can handle, resulting in more failures. This same cluster of simultaneous requests can recur repeatedly, and likely fail repeatedly, even if the overall level of load on the service outside of those retry spikes is small.
By introducing jitter, the initial group of failing requests may be clustered in a very small window, say 100ms, but with each retry cycle, the cluster of requests spreads into a larger and larger time window, reducing the size of the spike at a given time. The service is likely to be able to handle the requests when spread over a sufficiently large window.
Randomization avoids the retries from several calls to happen at the same time.
More information on Exponential Backoff And Jitter can be found here: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
What is the best practice for Max Wait (ms) value in JDBC Connection Configuration?
JDBC
I am executing 2 types of tests:
20 loops for each number of threads - to get max Throupught
30min runtime for each number of Threads - to get Response time
With Max Wait = 10000ms I can execute JDBC request with 10,20,30,40,60 and 80 Threads without an error. With Max Wait = 20000ms I can go higher and execute with 100, 120, 140 Threads without an error. It seems to be logical behaviour.
Now question.
Can I increase Max Wait value as desired? Is it correct way how to get more test results?
Should I stop testing and do not increase number of Threads if any error occur in some Report? I got e.g. 0.06% errors from 10000 samples. Is this stop for my testing?
Thanks.
Everything depends on what your requirements are and how you defined performance baseline.
Can I increase Max Wait value as desired? Is it correct way how to get more test results?
If you are OK with higher response times and the functionality should be working, then you can keep max time as much as you want. But, practically, there will be the threshold to response times (like, 2 seconds to perform a login transaction), which you define as part of your performance SLA or performance baseline. So, though you are making your requests successful by increasing max time, eventually it is considered as failed request due to high response time (by crossing threshold values)
Note: Higher response times for DB operations eventually results in higher response times for web applications (or end users)
Should I stop testing and do not increase number of Threads if any error occur in some Report?
Same applies to error rates as well. If SLA says, some % error rate is agreed, then you can consider that the test is meeting SLA or performance baseline if the actual error rate is less that that. eg: If requirements says 0% error rate, then 0.1% is also considered as failed.
Is this stop for my testing?
You can stop the test at whatever the point you want. It is completely based on what metrics you want to capture. From my knowledge, It is suggested to continue the test, till it reaches a point where there is no point in continuing the test, like error rate reached 99% etc. If you are getting error rate as 0.6%, then I suggest to continue with the test, to know the breaking point of the system like server crash, response times reached to unacceptable values, memory issues etc.
Following are some good references:
https://www.nngroup.com/articles/response-times-3-important-limits/
http://calendar.perfplanet.com/2011/how-response-times-impact-business/
difference between baseline and benchmark in performance of an application
https://msdn.microsoft.com/en-us/library/ms190943.aspx
https://msdn.microsoft.com/en-us/library/bb924375.aspx
http://searchitchannel.techtarget.com/definition/service-level-agreement
This setting maps to DBCP -> BasicDataSource -> maxWaitMillis parameter, according to the documentation:
The maximum number of milliseconds that the pool will wait (when there are no available connections) for a connection to be returned before throwing an exception, or -1 to wait indefinitely
It should match the relevant setting of your application database configuration. If your goal is to determine the maximum performance - just put -1 there and the timeout will be disabled.
In regards to Is this stop for my testing? - it depends on multiple factors like what application is doing, what you are trying to achieve and what type of testing is being conducted. If you test database which orchestrates nuclear plant operation than zero error threshold is the only acceptable. And if this is a picture gallery of cats, this error level can be considered acceptable.
In majority of cases performance testing is divided into several test executions like:
Load Testing - putting the system under anticipated load to see if it capable to handle forecasted amount of users
Soak Testing - basically the same as Load Testing but keeping the load for a prolonged duration. This allows to detect e.g. memory leaks
Stress testing - determining boundaries of the application, saturation points, bottlenecks, etc. Starting from zero load and gradually increasing it until it breaks mentioning the maximum amount of users, correlation of other metrics like Response Time, Throughput, Error Rate, etc. with the increasing amount of users, checking whether application recovers when load gets back to normal, etc.
See Why ‘Normal’ Load Testing Isn’t Enough article for above testing types described in details.
When you have a multiplayer game where the server is receiving movement (location) information from the client, you want to verify this information as an anti-cheating measure.
This can be done like this:
maxPlayerSpeed = 300; // = 300 pixels every 1 second
if ((1000 / (getTime() - oldTimestamp) * (newPosX - oldPosX)) > maxPlayerSpeed)
{
disconnect(player); //this is illegal!
}
This is a simple example, only taking the X coords into consideration. The problem here is that the oldTimestamp is stored as soon as the last location update was received by the server. This means that if there was a lag spike at that time, the old timestamp will be received much later relatively than the new location update by the server. This means that the time difference will not be accurate.
Example:
Client says: I am now at position 5x10
Lag spike: server receives this message at timestamp 500 (it should normally arrive at like 30)
....1 second movement...
Client says: I am now at position 20x15
No lag spike: server receives message at timestamp 1530
The server will now think that the time difference between these two locations is 1030. However, the real time difference is 1500. This could cause the anti-cheating detection to think that 1030 is not long enough, thus kicking the client.
Possible solution: let the client send a timestamp while sending, so that the server can use these timestamps instead
Problem: the problem with that solution is that the player could manipulate the client to send a timestamp that is not legal, so the anti-cheating system won't kick in. This is not a good solution.
It is also possible to simply allow maxPlayerSpeed * 2 speed (for example), however this basically allows speed hacking up to twice as fast as normal. This is not a good solution either.
So: do you have any suggestions on how to fix this "server timestamp & latency" issue in order to make my anti-cheating measures worthwhile?
No no no.. with all due respect this is all wrong, and how NOT to do it.
The remedy is not trusting your clients. Don't make the clients send their positions, make them send their button states! View the button states as requests where the clients say "I'm moving forwards, unless you object". If the client sends a "moving forward" message and can't move forward, the server can ignore that or do whatever it likes to ensure consistency. In that case, the client only fools itself.
As for speed-hacks made possible by packet flooding, keep a packet counter. Eject clients who send more packets within a certain timeframe than the allowed settings. Clients should send one packet per tick/frame/world timestep. It's handy to name the packets based on time in whole timestep increments. Excessive packets of the same timestep can then be identified and ignored. Note that sending the same packet several times is a good idea when using UDP, to prevent package loss.
Again, never trust the client. This can't be emphasized enough.
Smooth out lag spikes by filtering. Or to put this another way, instead of always comparing their new position to the previous position, compare it to the position of several updates ago. That way any short-term jitter is averaged out. In your example the server could look at the position before the lag spike and see that overall the player is moving at a reasonable speed.
For each player, you could simply hold the last X positions, or you might hold a lot of recent positions plus some older positions (eg 2, 3, 5, 10 seconds ago).
Generally you'd be performing interpolation/extrapolation on the server anyway within the normal movement speed bounds to hide the jitter from other players - all you're doing is extending this to your cheat checking mechanism as well. All legitimate speed-ups are going to come after an apparent slow-down, and interpolation helps cover that sort of error up.
Regardless of opinions on the approach, what you are looking for is the speed threshold that is considered "cheating".
Given a a distance and a time increment, you can trivially see if they moved "too far" based on your cheat threshold.
time = thisTime - lastTime;
speed = distance / time;
If (speed > threshold) dudeIsCheating();
The times used for measurement are server received packet times. While it seems trivial, it is calculating distance for every character movement, which can end up very expensive. The best route is server calculate position based on velocity and that is the character's position. The client never communicates a position or absolute velocity, instead, the client sends a "percent of max" velocity.
To clarify:
This was just for the cheating check. Your code has the possibility of lag or long processing on the server affect your outcome. The formula should be:
maxPlayerSpeed = 300; // = 300 pixels every 1 second
if (maxPlayerSpeed <
(distanceTraveled(oldPos, newPos) / (receiveNewest() - receiveLast()))
{
disconnect(player); //this is illegal!
}
This compares the players rate of travel against the maximum rate of travel. The timestamps are determined by when you receive the packet, not when you process the data. You can use whichever method you care to to determine the updates to send to the clients, but for the threshold method you want for determining cheating, the above will not be impacted by lag.
Receive packet 1 at second 1: Character at position 1
Receive packet 2 at second 100: Character at position 3000
distance traveled = 2999
time = 99
rate = 30
No cheating occurred.
Receive packet 3 at second 101: Character at position 3301
distance traveled = 301
time = 1
rate = 301
Cheating detected.
What you are calling a "lag spike" is really high latency in packet delivery. But it doesn't matter since you aren't going by when the data is processed, you go by when each packet was received. If you keep the time calculations independent of your game tick processing (as they should be as stuff happened during that "tick") high and low latency only affect how sure the server is of the character position, which you use interpolation + extrapolation to resolve.
If the client is out of sync enough to where they haven't received any corrections to their position and are wildly out of sync with the server, there is significant packet loss and high latency which your cheating check will not be able to account for. You need to account for that at a lower layer with the handling of actual network communications.
For any game data, the ideal method is for all systems except the server to run behind by 100-200ms. Say you have an intended update every 50ms. The client receives the first and second. The client doesn't have any data to display until it receives the second update. Over the next 50 ms, it shows the progression of changes as it has already occurred (ie, it's on a very slight delayed playback). The client sends its button states to the server. The local client also predicts the movement, effects, etc. based on those button presses but only sends the server the "button state" (since there are a finite number of buttons, there are a finite number of bits necessary to represent each state, which allows for a more compact packet format).
The server is the authoritative simulation, determining the actual outcomes. The server sends updates every, say, 50ms to the clients. Rather than interpolating between two known frames, the server instead extrapolates positions, etc. for any missing data. The server knows what the last real position was. When it receives an update, the next packet sent to each of the clients includes the updated information. The client should then receive this information prior to reaching that point in time and the players react to it as it occurs, not seeing any odd jumping around because it never displayed an incorrect position.
It's possible to have the client be authoritative for some things, or to have a client act as the authoritative server. The key is determining how much impact trust in the client is there.
The client should be sending updates regularly, say, every 50 ms. That means that a 500 ms "lag spike" (delay in packet reception), either all packets sent within the delay period will be delayed by a similar amount or the packets will be received out of order. The underlying networking should handle these delays gracefully (by discarding packets that have an overly large delay, enforcing in order packet delivery, etc.). The end result is that with proper packet handling, the issues anticipated should not occur. Additionally, not receiving explicit character locations from the client and instead having the server explicitly correct the client and only receive control states from the client would prevent this issue.