How to process a logic or job periodically for all users in a large scale? - algorithm

I have a large set of users in my project like 50m.
I should create a playlist for each user every day, for doing this, I'm currently using this method:
I have a column in my users' table that holds the latest time of creating a playlist for that user, and I name it last_playlist_created_at.
I run a query on the users' table and get the top 1000s, that selects the list of users which their last_playlist_created_at is past one day and sort the result in ascending order by last_playlist_created_at
After that, I run a foreach on the result and publish a message for each in my message-broker.
Behind the message-broker, I start around 64 workers to process the messages (create a playlist for the user) and update last_playlist_created_at in the users' table.
If my message-broker messages list was empty, I will repeat these steps (While - Do-While)
I think the processing method is good enough and can be scalable as well,
but the method we use to create the message for each user is not scalable!
How should I do to dispatch a large set of messages for each of my users?

Ok, so my answer is completely based on your comment where you mentioned that you use while(true) to check if the playlist needs to be updated which does not seem so trivial.
Although this is a design question and there are multiple solutions, here's how I would solve it.
First up, think of updating the playlist for a user as a job.
Now, in your case this is a scheduled Job. ie. once a day.
So, use a scheduler to schedule the next job time.
Write a Scheduled Job Handler to push this to a Message Queue. This part is just to handle multiple jobs at the same time where you could control the flow.
Generate the playlist for the user based on the job. Create a Schedule event for the next day.
You could persist Scheduled Job data just to avoid race conditions.

Related

Laravel - Removing jobs on queue based on model association

I'm new to jobs and queues.
At the moment, I'm only really using the ->later() method on a Mail. This places each mail on the default queue.
There are instances where I need cancel jobs on the queue related to a specific model ID. I don't really see any reference to deleting pending jobs in the queue - only deleting / clearing failed.
In Telescope, there are tags showing the Model IDs associated with each pending job.
There are a few things I was hoping to do:
Delete all jobs associated with a specific model ID
Listen for the execution of a job based on a specific model ID, so I may update the database table with the date/timestamp of when the job actually executed. (users can queue emails to send hours in advance and I'd like to log when their customer actually receives the email)
Remove record associated with job since it should not exist if the email didn't actually get sent.
Hoping for some advice on how to solve this problem of needing to manage jobs in this fashion.
I'm using Redis if that makes any difference.

How to avoid concurrent requests to a lambda

I have a ReportGeneration lambda that takes request from client and adds following entries to a DDB table.
Customer ID <hash key>
ReportGenerationRequestID(UUID) <sort key>
ExecutionStartTime
ReportExecutionStatus < workflow status>
I have enabled DDB stream trigger on this table and a create entry in this table triggers the report generation workflow. This is a multi-step workflow that takes a while to complete.
Where ReportExecutionStatus is the status of the report processing workflow.
I am supposed to maintain the history of all report generation requests that a customer has initiated.
Now What I am trying to do is avoid concurrent processing requests by the same customer, so if a report for a customer is already getting generated don’t create another record in DDB ?
Option Considered :
query ddb for the customerID(consistent read) :
- From the list see if any entry is either InProgress or Scheduled
If not then create a new one (consistent write)
Otherwise return already existing
Issue: If customer clicks in a split second to generate report, two lambdas can be triggered, causing 2 entires in DDB and two parallel workflows can be initiated something that I don’t want.
Can someone recommend what will be the best approach to ensure that there are no concurrent executions (2 worklflows) for the same Report from same customer.
In short when one execution is in progress another one should not start.
You can use ConditionExpression to only create the entry if it doesn't already exist - if you need to check different items, than you can use DynamoDB Transactions to check if another item already exists and if not, create your item.
Those would be the ways to do it with DynamoDB, getting a higher consistency.
Another option would be to use SQS FIFO queues. You can group them by the customer ID, then you wouldn't have concurrent processing of messages for the same customer. Additionally with this SQS solution you get all the advantages of using SQS - like automated retry mechanisms or a dead letter queue.
Limiting the number of concurrent Lambda executions is not possible as far as I know. That is the whole point of AWS Lambda, to easily scale and run multiple Lambdas concurrently.
That said, there is probably a better solution for your problem using a DynamoDB feature called "Strongly Consistent Reads"
By default reads to DynamoDB (if you use the AWS SDK) are eventually consistent, causing the behaviour you observed: Two writes to the same table are made but your Lambda only was able to notice one of those writes.
If you use Strongly consistent reads, the documentation states:
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
So your Lambda needs to do a strongly consistent read to your table to check if the customer already has a job running. If there is already a job running the Lambda does not create a new job.

Queueing batch translations with laravel best method

Looking for some guidance on best architecture to accomplish what I am trying to do. I occasionally get spreadsheets that will have a column of data that will need to be translated. There could be anywhere from 200 to 10,000 rows in that column. What I want to do is pull all rows and add them to a redis queue. I am thinking Redis will be best as I can throttle the queue which is necessary as the api I am calling for translation has throttle limits. Once the translation is done I will put the translations into a new column and return the user a new spreadsheet with the additional column.
If anyone has ideas for best setup I am open but I want to stick with laravel as that is what the application is already running. I am just not sure if I should create one queue job and that queue process will just open the file and start doing the translations. Or do I add a queue for each row of text. Or lastly do I add all of the rows of text to a table in my database and then have a task scheduler running every minute that will check that table for any untranslated rows and process x amount of them each time is checks. Not sure about cron job running so frequently when this happens maybe twice a month.
I can see a lot of ways of doing it but looking for an ideal setup as what I don't want to happen is I hit throttle limits and lose potential translations I have done as it could error out.
Thanks for any advice

How can I check the status of a queue in Laravel?

I need to submit a number of jobs to a Laravel queue to process some uploaded CSV files. These jobs could be finished in one second if the files are small, or a few seconds if they're bigger, or possibly up to a minute if the CSV files are very big. And I can't tell in advance how big the files will be.
When the user goes to the "results" page, I need to display the results - but only if the queue has finished the jobs. If the queue is still processing, I need to display a "try again later" message.
So - is there a way to check, from a controller, whether the queue has finished?
I'm currently using Laravel 5.1 but would happily upgrade if that helps. And I'm currently using the database queue driver. Ideally I'd love to find a general technique that works for all queue drivers, but if the only way to do it is to check a database table then I guess that's what I have to do.
Thanks!
I know this is a year old, but why not create a new queue per upload with a unique key based on that request.
$job = (new ProcessCSVJob($data))->onQueue($uniqueQueueName);
You can then simply either do a count in the database on the queue name field if you want a DB only solution.
To work across all queue types you can use the Queue size method to return the queue size.
$queue = App::make('queue.connection');
$size = $queue->size($uniqueQueueName);
This is in Laravel 5.4. Not sure how backwards compatible this is.
I expect if I was trying to do this today, I'd use a Laravel Job Event to update a status field on the database record, to log when the job has started and when it's finished.
Then I could see whether a record has been fully processed or not by just looking at the status field on the record itself.

Scaling message queues with lots of API calls

I have an application where some of my user's actions must be retrieved via a 3rd party api.
For example, let's say I have a user that can receive tons of phone calls. This phone call record should be update often because my user want's to see the call history, so I should do this "almost in real time". The way I managed to do this is to retrieve every 10 minutes the list of all my logged users and, for each user I enqueue a task that retrieves the call record list from the timestamp of the latest saved record to the current timestamp and saves all that to my database.
This doesn't seems to scale well because the more users I have, then, the more connected users I'll have and the more tasks i'll enqueue.
Is there any other approach to achieve this?
Seems straightforward with background queue of jobs. It is unlikely that all users use the system at the same rate so queue jobs based on their use. With fall back to daily.
You will likely at some point need more workers taking jobs from the queue and then multiple queues so if you had a thousand users the ones with a later queue slot are not waiting all the time.
It also depends how fast you need this updated and limit on api calls.
There will be some sort of limit. So suggest you start with committing to updated with 4h or 1h delay to always give some time and work on improving this to sustain level.
Make sure your users are seeing your data and cached api not live call api data incase it goes away.

Resources