We are running our website on the AWS serverless infrastructure through Laravel vapor now. It runs on a serverless Aurora database and uses Amazon SQS as the queue driver. Basically the defaults for a Laravel Vapor deploy.
We are currently running into a performance issue that while we are trying to resolve it, has us smash our head into other walls we meet on our journey. So here we go.
What we want to do
Our customers can have various types of subscriptions that need to be renewed. These subscriptions come in all kinds of flavours, can have different billing intervals, different currencies and different prices.
On our customer dashboard we want to present them with a nice card informing the customer about their upcoming renewals, with a good estimate of what the cost will be, so they can make sure they have enough credits:
As our subscriptions come with different flavours, we have one table that holds the basic subscription details, such as net_renewal_date. It also references a subscriptionable_id of subscriptionable_type which is a morph relation ship to the different types of models that a user can have a subscription for.
Our first attempt
In our first attempt, we basically added a REST endpoint which fetched the upcoming renewal forecast.
It basically took all Subscriptions that were up for renewal the coming 30 days. For each of those items, we would calculate the current price in its currency, and add tax caculations.
That was then returned into a collection that we further used to:
1/ calculated the total per currency
2/ filter that collection for items within the next 14 days, and calculate the same total per currency.
We would then basically just convert the different amounts to our base currency (EUR) and return the sum thereof.
This worked great for the vast majority of our clients. Even customers with 100 subscriptions were no issue at all.
But then we migrated one of our larger customers to the new platform. He basically had 1500+ subcsriptions renewing in the upcoming 30 days, so that didn't go well...
Our second attempt
Because going through the above code simply doens't work in an acceptable amount of time, we decided we had to move the simulation calculation into a seperate job.
We added an attribute to the subscriptions table and called it 'simulated_renewal_amount'
This job would need to run every time when:
- a price changes
- the customer's discount would change (based on his loyalty, we provide seperate prices
- the exchanges rates change.
So the idea was to listen for any of these changes, and then dispatch a job to recalculate the simulated amount to any of the involved subscriptions. This however means that a change in an exchange rate for instance can easily trigger 10,000 jobs to be processed.
And this is where it becomes tricky.
Even though running just one job only takes less than 1200ms in most cases, it seems that dispatching a lot of jobs that need to do the same calculations for a set of subscriptions is causing jobs running 60+ seconds when they are being aborted.
What is the best practice to setup such a queued job? Should I just created one job in stead and process them sequentially?
Any insights on how we can best set this up to start with, would be very welcome. We've played a lot with it, and it always seems to be ending up with the same kind of issues.
FYI - we host the site on laravel vapor, so serverless on AWS infrastructure with an Aurora database.
We have got the same issue. Vapor supports multiple queues but it does not allow you to set job concurrency on a per queue basis, so its not very configurable for drip feeding lots of jobs. We have solved this by making a seeder job that pulls out serialized jobs from an "instant jobs" table. We added a sleep loop also to allow granular processing throughout the whole minute (a new seeder job is scheduled each minute).
public function handle()
{
$killAt = Carbon::now()->addSeconds(50);
do{
InstantJob::orderBy('id')->cursor()->each(function(InstantJob $job){
if($job->isThrottled()){
return true;
}
$job->dispatch();
});
sleep(5);
} while (Carbon::now()->lessThan($killAt));
}
The throttle, if you are interested works off a throttle key (job group/name etc.) and looks like:
public function isThrottled(): bool
{
Redis::connection('cache')
->throttle($this->throttle_key)
->block(0)
->allow(10) //jobs
->every(5) // seconds
->then(function () use(&$throttled) {
$throttled = false;
}, function () use(&$throttled){
$throttled = true;
});
return $throttled;
}
This actually solves our problem of drip feeding jobs onto the queue without actually starting them.
One question for you... We are currently using a small RDS instance and we get a lot of issues with too many concurrent connections. Do you see this issue with serverless db's? Do they scale fast enough to ensure no drop outs?
Related
I've an application with 10M users. The application has access to the user's Google Health data. I want to periodically read/refresh users' data using Google APIs.
The challenge that I'm facing is the memory-intensive task. Since Google does not provide any callback for new data, I'll be doing background sync (every 30 mins). All users would be picked and added to a queue, which would then be picked sequentially (depending upon the number of worker nodes).
Now for 10M users being refreshed every 30 mins, I need a lot of worker nodes.
Each user request takes around 1 sec including network calls.
In 30 mins, I can process = 1800 users
To process 10M users, I need 10M/1800 nodes = 5.5K nodes
Quite expensive. Both monetary and operationally.
Then thought of using lambdas. However, lambda requires a NAT with an internet gateway to access the public internet. Relatively, it very cheap.
Want to understand if there's any other possible solution wrt the scale?
Without knowing more about your architecture and the google APIs it is difficult to make a recommendation.
Firstly I would see if google offer a bulk export functionality, then batch up the user requests. So instead of making 1 request per user you can make say 1 request for 100k users. This would reduce the overhead associated with connecting and processing/parsing of the message metadata.
Secondly i'd look to see if i could reduce the processing time, for example an interpreted language like python is in a lot of cases much slower than a compiled language like C# or GO. Or maybe a library or algorithm can be replaced with something more optimal.
Without more details of your specific setup its hard to offer more specific advice.
Users are able to set up a marketing email send time within my app for random dates as they need them. It is crucial that the emails start to go out exactly when they are scheduled. So, the app needs to create something that fires one time for a specific group of emails at a specific date and time down to the minute. There will be many more dates for other sends in the future, but they are all distinct (so need something other than 'run every xxx date')
I could have a Scheduler task that runs every minute that looks at the dates of any pending tasks in the database and moves to the command that sends those that need sending. But I'm running a multi-tenanted app -- likely not overlap, but seems like a huge hit to multiple databases every minute for every tenant to search through potentially thousands of records per tenant.
What I need (I think) is a way to programmatically set a schedule for each send date, and then remove it. I liked this answer or perhaps using ->between($startMin, $endMin), without the every xxx instruction, or even using the cron function within Scheduler to set up one single time for each group that needs to be sent. Not sure I'm using this the way it was intended though?
Would this even work? I programmatically created a schedule from within a test method, and it was added to the schedule queue based on the dump of the $schedule I created, showing all schedules - but it did not show up via this method found from this answer:
$schedule = app()->make(\Illuminate\Console\Scheduling\Schedule::class);
$events = collect($schedule->events())->filter(function (\Illuminate\Console\Scheduling\Event $event) {
return stripos($event->command, 'YourCommandHere');
});
It also did not output anything, so I'm wondering if programmatically creating a schedule outside of Kernel.php is not the way to go.
Is there another way? I don't know scheduler well enough to know if these one-off schedules permanently remain somewhere, are they deleted after their intended single use, taking up memory, etc?
Really appreciate any help or guidance.
I want to send emails to various users based on the schedules they have set.
I read about beanstalkd, queues and Delayed Message Queueing and for now it looks like fitting in:
$when = Carbon::now()->addMinutes($minutes); // i can calculate minutes at this moment
\Mail::to($user)->later($when, new \App\Mail\TestMail);
But i'm not quite sure on few things:
User can cancel a future schedule. In that case how do i cancel an email that's suppose to send in future. Can i set condition somewhere that gets checked before sending the actual email? Tried return false on handle method of \App\Mail\TestMail and it started throwing error
Am i using the right approach. I also read about Scheduler but i don't get how i am going to cancel future emails(if they need to be)
There are many ways to approach this. Personally I would queue the emails on a schedule rather than adding them to the queue for later.
So you run a scheduled task once a day (or hour, or minute) which runs a query to select which users require an email, then using that result set, you add a job to the queue for each result.
This way, if a user unsubscribes, you don't have to worry about removing already queued jobs.
Laravel offers quite a nice interface for creating scheduled jobs (https://laravel.com/docs/5.4/scheduling) which can then be called via a cronjob.
We have a Node.js application running loopback, the main purpose of which is to process orders received from the client. Currently the entire order process is handled during the single http request to make the order, including the payment, insertion into the database and sending confirmation emails etc.
We are finding that this method, whilst working at the moment, lacks scalability - the application is going to need to process, potentially, thousands of orders per minute as it grows. In addition, our order process currently writes data to our own database, however we are now looking at third party integrations (till systems) over which we have no control of the speed or availability.
In addition, we also currently have a potential race condition; we have to assign a 'short code' to each order for easy reference by the client - these need to rotate, so if the starting number is 1 and the maximum is 100, the 101st order must be assigned the number 1. At the moment we are looking at the previous order and either incrementing the previous reference by 1 or setting it back to the start - obviously this is fine at the moment due to the low traffic - however as we scale this could result in multiple orders being assigned the same reference number.
Therefore, we want to implement a queue to manage all of this. Our app is currently deployed on Heroku, where we already use a worker process for some of the monthly number crunching our app requires. Whilst having read some of the Heroku articles on implementing a queue (https://devcenter.heroku.com/articles/asynchronous-web-worker-model-using-rabbitmq-in-node, https://devcenter.heroku.com/articles/background-jobs-queueing) it is not clear how, over multiple worker dynos, we would ensure the order in which these queued items are processed and that the same job is not processed more than once by multiple dynos. The order of processing is not so important, however the lack of repetition is extremely important as if two orders are processed concurrently we run the risk of the above race condition.
So essentially my question is this; how do we avoid the same queue job being processed more than once when scaled across multiple dynos on Heroku?
What you need is already provided by RabbitMQ, the message broker used by the CloudAMQP add-on of Heroku.
You don't need to worry about the race condition of multiple workers. A job placed onto the queue is stored until a consumer retrieves it. When a worker consumes a job from the queue, no other workers will be able to consume it.
RabbitMQ manages all such aspects of message queing paradigm.
A couple of links useful for your project:
What is RabbitMQ?
Getting started with RabbitMQ and Node.js
What is the best practise solution for programmaticaly changing the XML file where the number of instances are definied ? I know that this is somehow possible with this csmanage.exe for the Windows Azure API.
How can i measure which Worker Role VMs are actually working? I asked this question on MSDN Community forums as well: http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/02ae7321-11df-45a7-95d1-bfea402c5db1
To modify the configuration, you might want to look at the PowerShell Azure Cmdlets. This really simplifies the task. For instance, here's a PowerShell snippet to increase the instance count of 'WebRole1' in Production by 1:
$cert = Get-Item cert:\CurrentUser\My\<YourCertThumbprint>
$sub = "<YourAzureSubscriptionId>"
$servicename = '<YourAzureServiceName>'
Get-HostedService $servicename -Certificate $cert -SubscriptionId $sub |
Get-Deployment -Slot Production |
Set-DeploymentConfiguration {$_.RolesConfiguration["WebRole1"].InstanceCount += 1}
Now, as far as actually monitoring system load and throughput: You'll need a combination of Azure API calls and performance counter data. For instance: you can request the number of messages currently in an Azure Queue:
http://yourstorageaccount.queue.core.windows.net/myqueue?comp=metadata
You can also set up your role to capture specific performance counters. For example:
public override bool OnStart()
{
var diagObj= DiagnosticMonitor.GetDefaultInitialConfiguration();
AddPerfCounter(diagObj,#"\Processor(*)\% Processor Time",60.0);
AddPerfCounter(diagObj, #"\ASP.NET Applications(*)\Request Execution Time", 60.0);
AddPerfCounter(diagObj,#"\ASP.NET Applications(*)\Requests Executing", 60.0);
AddPerfCounter(diagObj, #"\ASP.NET Applications(*)\Requests/Sec", 60.0);
//Set the service to transfer logs every minute to the storage account
diagObj.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(1.0);
//Start Diagnostics Monitor with the new storage account configuration
DiagnosticMonitor.Start("DiagnosticsConnectionString",diagObj);
}
So this code captures a few performance counters into local storage on each role instance, then every minute those values are transferred to table storage.
The trick, now, is to retrieve those values, parse them, evaluate them, and then tweak your role instances accordingly. The Azure API will let you easily pull the perf counters from table storage. However, parsing and evaluating will take some time to build out.
Which leads me to my suggestion that you look at the Azure Dynamic Scaling Example on the MSDN code site. This is a great sample that provides:
A demo line-of-business app hosting a wcf service
A load-generation tool that pushes messages to the service at a rate you specify
A load-monitoring web UI
A scaling engine that can either be run locally or in an Azure role.
It's that last item you want to take a careful look at. Based on thresholds, it compares your performance counter data, as well as queue-length data, to those thresholds. Based on the comparisons, it then scales your instances up or down accordingly.
Even if you end up not using this engine, you can see how data is grabbed from table storage, massaged, and used for driving instance changes.
Quantifying the load is actually very application specific - particularly when thinking through the Worker Roles. For example, if you are doing a large parallel processing application, the expected/hoped for behavior would be 100% CPU utilization across the board and the 'scale decision' may be based on whether or not the work queue is growing or shrinking.
Further complicating the decision is the lag time for the various steps - increasing the Role Instance Count, joining the Load Balancer, and/or dropping from the load balancer. It is very easy to get into a situation where you are "chasing" the curve, constantly churning up and down.
As to your specific question about specific VMs, since all VMs in a Role definition are identical, measuring a single VM (unless the deployment starts with VM count 1) should not really tell you much - all VMs are sitting behind a load balancer and/or are pulling from the same queue. Any variance should be transitory.
My recommendation would be to pick something that is not inherently highly variable to monitor (e.g. CPU). Generally, you want to find a trending point - for web apps it may be the response queue, for parallel apps it may be azure queue depth, etc. but for either they would be the trend and not the absolute number. I would also suggest measuring them at fairly broad intervals - minutes, not seconds. If you have a load you need to respond to in seconds, then realistically you will need to increase your running instance count ahead of time.
With regard to your first question, you can also use the Autoscaling Application Block to dynamically change instance counts based on a set of predefined rules.