I have a finite set of tasks that need to be completed by clients. Clients get assigned a task on connection, and keep getting new tasks after they finished the previous task. Each tasks need to be completed by 3 unique clients. This makes sure that clients do not give wrong results to the tasks.
However, I don't want clients to take longer than 3000ms. As some tasks are dependent of each other, this could stall the progress.
The problem is that i'm having trouble checking timeouts of tasks - which should be done when no free tasks are available.
At this moment each tasks has a property called assignedClients which looks as follows:
assignedClients: [
client: Client,
start: Date,
completed: true
client: Client,
start: Date,
completed: true
client: Client,
start: Date,
completed: false
All tasks (roughly 1000) are stored in a single array. Basically, when a client needs a new task, the pseudo-code is like this:
function onTaskRequest:
for (task in tasks):
if (assignedClients < 3)
assignClientToTask(task, client)
// so no available tasks
for (task in tasks):
for (client in assignedClients):
if (client.completed === false && - client.start > 3000):
assignClientToTask(task, client)
But this seems very inefficient. Is there an algorithm that is more effective?

What you want to do is store tasks in a priority queue (which is often implemented as a heap) by when they are available with the oldest first. When a client needs a new task you just peek at the top of the queue. If it can be scheduled at all, it can be scheduled on that task.
When the task is inserted it is given now as its priority. When you fill the task list up, you put it in at a timing that is the expiry of the oldest client to grab it.
If you're using a heap, then all operations should be no worse than O(log(n)) as compared to your current O(n) implementation.
Your data structure looks like JSON, in which case is the first JavaScript implementation of a priority queue that turned up when I looked in Google. Your pseudocode looks like Python in which case is in the standard library. If you can't find an implementation in your language, should be able to help you figure out how to implement it.


NATS KV history larger than specified when creating bucket

We are using nats with KeyValue store feature (nats KV). We develop go microservices and use the nats go client. We try to leverage the history feature of nats KV with no success yet.
Certain times using nats, we retrieve a larger history than the history specified when creating the KV.
We create the KV using :
kv, _ := js.CreateKeyValue(&nats.KeyValueConfig{
Bucket: "some-bucket",
Description: "store for some-service",
MaxValueSize: 0,
History: 10, // should we ever get more than 10 elements when reading history ?
MaxBytes: 5000000,
Storage: nats.MemoryStorage,
Replicas: 0,
Placement: nil,
and we retrieve values using
When we get results larger than the specified History, we get several KeyValueEntrys with the same delta value.
We are quite write intensive, and also reuse quite a lot the same key id :
we write values until a certain point,
call kv.Purge("someId")
and then we may reuse "someId" later on in the process.
Writes and read are asynchronous and concurrent.
Here is our client go.mod regarding nats: v2.8.4 v1.16.0
and we run a nats server version 2.8.4.
note : I did not go far enough in the KV implementation details but I am worried that this is linked with jetstream. It seems like a watcher is created each time and re-reads all previous values regardless of history size. It leads me to another question : is the kv history feature appropriate for read intensive use cases ?
Thanks for your help or pointers on this matter.

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : {
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= {
current = next;
for (MergeCandidate merge : merges) {

SQS task going in DLQ despite being successful in Lambda + also when deleted manually

I have built my own application around AWS Lambda and Salesforce.
I have around 10 users using my internal app, so not talkiing about big usage.
Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.
This is working perfectly.
Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8
I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic.
Everything is working well.
My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).
The thing I don't get is why is this going on like this ?
Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.
So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ...
Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).
What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?
Example of my async processor
def process_data(event, context):
By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
- python_module: use to determine the location of the method to call asynchronously
- python_function: use to determine the location of the method to call asynchronously
- uuid: uuid to get the params stored in dynamodb
print('Start Processing Async')
client = boto3.client('sqs')
queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']
# batch size = 1 so only record 1 to process
for record in event['Records']:
kwargs = json.loads(record['body'])
print(f'Start Processing Async Data Record:\n{kwargs}')
python_module = kwargs['python_module']
python_function = kwargs['python_function']
getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)
print('End Processing Async Data Record')
res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
print(f'End Deleting Async Data Record with status: {res}') # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.
except Exception:
# set expire to 0 so that the task goes into DLQ
utils.raise_exception(f'There was a problem during async processing. Event:\n'
f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')
Example of today's bug with logs from CloudWatch:
Initial event:
{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}
Res (after calling delete_message):
End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}
BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy
OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.
Hence the deletion doing anything since the task was already in the DLQ when the deletion was made

Concurrency in Redis for flash sale in distributed system

I Am going to build a system for flash sale which will share the same Redis instance and will run on 15 servers at a time.
So the algorithm of Flash sale will be.
Set Max inventory for any product id in Redis
using redisTemplate.opsForValue().set(key, 400L);
for every request :
get current inventory using Long val = redisTemplate.opsForValue().get(key);
check if it is non zero
if (val == null || val == 0) {
System.out.println("not taking order....");
put order in kafka
and decrement using redisTemplate.opsForValue().decrement(key)
But the problem here is concurrency :
If I set inventory 400 and test it with 500 request thread,
Inventory becomes negative,
If I make function synchronized I cannot manage it in distributed servers.
So what will be the best approach to it?
Note: I can not go for RDBMS and set isolation level because of high request count.
Redis is monothreaded, so running a Lua Script on it is always atomic.
You can define then a Lua script on your Redis instance and running it from your Spring instances.
Your Lua script would just be a sequence of operations to execute against your redis instance (the only one to have the correct value of your stock) and returns the new value for instance or an error if the value is negative.
Your Lua script is basically a Redis transaction, there are other methods to achieve Redis transaction but IMHO Lua is the simplest above all (maybe the least performant, but I have found that in most cases it is fast enough).

Job with multiple tasks on different servers

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
hostname: 'host where to execute the task',
task: 'current task number',
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
task: 'current task number',
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.
