How to choose the right value for the expiryTime parameter for RedLockFactory.CreateLockAsync() method? - redlock.net

I am using RedLock.net library for resource locking. To lock a resoruce I am using RedLockFactory.CreateLockAsync.
public async Task<IRedLock> RedLockFactory.CreateLockAsync(string resource,
TimeSpan expiryTime,
TimeSpan waitTime,
TimeSpan retryTime,
CancellationToken? cancellationToken = null)
I understand that this method will attempt to acquire a lock for waitTime by keep retrying every retryTime. However I do not understand what would be the right value for expiryTime.
Once a lock has been acquired it will be kept until the lock is Disposed and that is irrespective of the expiryTime. In other words even if expirtyTime is set to 5 seconds if the lock is only diposed after 10 seconds then the lock will be kept for 10 seconds.
In many examples the value of 30 is used without explanation.
I have tested with a value of 0. A lock is not acquired at all.
I have tested with a value of 5 milliseconds. A lock is acquired and kept until disposed.
So how to choose the right value for the expiryTime parameter? It seems to me that this parameter is unnecessary and any non zero positive value is ok.

ExpiryTime determines the maximum time that a lock will be held in the case of a failure (say, the process holding the lock crashing). It also indirectly determines how often the lock is renewed while it is being held.
e.g.
If you set an expiry time of 10 minutes:
the automatic lock renewal timer will call out to redis every 5 minutes (expiry time / 2) to extend the lock
if your process crashes without releasing the lock, you will have to wait up to a maximum of 10 minutes until the key expires in redis and another process can take out a lock on the same resource
If you set an expiry time of 10 milliseconds:
the automatic lock renewal timer will call out to redis every 5 milliseconds (expiry time / 2) to extend the lock (which might be a little excessive)
if your process crashes without releasing the lock, you will have to wait up to a maximum of 10 milliseconds until the key expires in redis and another process can take out a lock on the same resource
It's a balance between how much time you're willing to wait for a lock to expire in the failure case vs how much load you put on your redis servers.

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

How can I extend the timeout of a context in Go?

For example, the context is created with timeout to be 10 seconds later.
After a while (e.g. 2 seconds later), I want to refresh it to be 10 seconds later from this time.
What can I do?
context.Context is not designed that way. context.Context is delegated down to workers, and if a worker finds that more time should be allowed, it can't override the "master's call".
If you have a situation where an initial 10 seconds timeout is to be used, but this 10 seconds is not written in stone (e.g. it may change before it expires), then don't use a context with 10 seconds timeout. Instead use a context with a cancel function: context.WithCancel(), and manage the 10 seconds timeout yourself (e.g. with time.AfterFunc() or with a time.Timer). If the timeout has expired and you (or your workers) did not detect that it should be extended, call the cancel function.
If before the deadline you detect the timeout should be extended, reset the timer and do not cancel the context with the cancel function.

Set infinite session timeout but limited per request timeout

I'm trying to quickly connect to a couple thousand sites (some up some down), but it seems setting aiohttp.ClientTimeout(total=60) and passing that in to ClientSession means there is only 60 seconds allowable in total for all sites. This means that after about a minute they all quickly fail with concurrent.futures._base.TimeoutError. I tried raising that timeout, which fixes that failure issue, but then the problem is that all of the threads end up getting hung on non responding sites for the entire length of it.
Is it possible to disable the total timeout, however have a per-request timeout of 60 seconds? (Edited) - There does seem to be a timeout parameter on session.get(...), however it seems that overrides the session timeout and causes the entire session to timeout upon expiration, and not just that request. If I set my ClientSession timeout to 600 but the session.get timeout to 15, all requests fail after 15 seconds
I want to be able to get through my full list of a couple thousand sites only waiting 60 second max for each connection, but have no total time limit. Is the only way to do this being to create a new session for each request?
timeout = aiohttp.ClientTimeout(total=60)
connector = aiohttp.TCPConnector(limit=40)
dummy_jar = aiohttp.DummyCookieJar()
async with aiohttp.ClientSession(connector=connector, timeout=timeout, cookie_jar=dummy_jar) as session:
for site in sites:
task = asyncio.ensure_future(make_request(session, site, connection_pool))
tasks.append(task)
await asyncio.wait(tasks)
connection_pool.close()

Socket Exception while running load test with Self Provisioned test rig

I am getting the Socket Exception while running load test on self-provisioned test rig.
I am trigger those load tests in agent machine(self-provisioned test rig) from my local machine.
Note : For first 2 to 3 minutes test iterations are passing , after that we are getting the Socket Exception.
Below is the error message :
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond.
Below are the stack trace details :
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at
System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure,
Socket s4, Socket s6, Socket& socket, IPAddress& address,
ConnectSocketState state, IAsyncResult asyncResult, Exception&
exception)
Run Time - 20min
Sample rate - 10sec
warm up duration 10sec
number of agents used - 2
Load pattern :
initial load - 10user
max user count - 300
step duration - 10sec
step user count - 10
Although, by Changing above values I am still getting the exception in the same way.
I am using Visual studio 2015 enterprise.
The question states: start with 10 users, every 10 seconds add 10 users to a maximum of 300. Thus after 29 increments there will be 300 users and that will take 29*10 seconds which is 4m50s. The test will thus (attempt to) run with the maximum load of 300 users for the remaining 15m10s.
Given that all tests pass for the first 2 or 3 minutes plus the the error message, that suggests that you are overloading some part of the network. It might be the agents, it might be the servers or it might be on the connections between them. Some network components have a maximum number of active connections and the 300 users might be too many.
Increasing the load so rapidly means you do not clearly know what the limiting value. The sampling rate (at 10 seconds) seems high. At each sampling interval a lot of data is transferred (i.e. the sample data) and that can swamp parts of the network. You should look at the network counters for the agents and controller, also the servers if available.
I recommend changing the load test steps to add 10 users every 30 seconds, so it takes about 15 minutes to reach 300 users. It may also be worth reducing the sample rate to every 20 seconds.

Redis subscriber is not notified by EXPIRE key 0

I've got a Redis client subscribed to __keyevent#0__:expired notifications. It works perfectly, either when the key expires by itself (ttl reached) or when I expire them manually with a number of seconds greater than 0, like so:
EXPIRE myKey 1
The subscriber sees the expired event and can therefore take some actions.
However, if I want to manually delete the key and have the subscriber notified, I use EXPIRE with 0 as the number of seconds:
EXPIRE myKey 0
The key gets deleted, but the subscriber doesn't receive anything.
I can't see anything related to this in the doc. Can anyone explain this behavior?
From reviewing the source code (expire.c, ~252), setting an expiry value of <=0 (or using EXPIREAT with a time in the past) results in a deletion of the key rather than an expiry (and accordingly a DEL notification rather than an EXPIRED event).
This behavior is indeed undocumented and it would be good if you could submit a PR that fixes that to the documentation repo (https://github.com/antirez/redis-doc).

Resources