Scaling message queues with lots of API calls - performance

I have an application where some of my user's actions must be retrieved via a 3rd party api.
For example, let's say I have a user that can receive tons of phone calls. This phone call record should be update often because my user want's to see the call history, so I should do this "almost in real time". The way I managed to do this is to retrieve every 10 minutes the list of all my logged users and, for each user I enqueue a task that retrieves the call record list from the timestamp of the latest saved record to the current timestamp and saves all that to my database.
This doesn't seems to scale well because the more users I have, then, the more connected users I'll have and the more tasks i'll enqueue.
Is there any other approach to achieve this?

Seems straightforward with background queue of jobs. It is unlikely that all users use the system at the same rate so queue jobs based on their use. With fall back to daily.
You will likely at some point need more workers taking jobs from the queue and then multiple queues so if you had a thousand users the ones with a later queue slot are not waiting all the time.
It also depends how fast you need this updated and limit on api calls.
There will be some sort of limit. So suggest you start with committing to updated with 4h or 1h delay to always give some time and work on improving this to sustain level.
Make sure your users are seeing your data and cached api not live call api data incase it goes away.

Related

How to process a logic or job periodically for all users in a large scale?

I have a large set of users in my project like 50m.
I should create a playlist for each user every day, for doing this, I'm currently using this method:
I have a column in my users' table that holds the latest time of creating a playlist for that user, and I name it last_playlist_created_at.
I run a query on the users' table and get the top 1000s, that selects the list of users which their last_playlist_created_at is past one day and sort the result in ascending order by last_playlist_created_at
After that, I run a foreach on the result and publish a message for each in my message-broker.
Behind the message-broker, I start around 64 workers to process the messages (create a playlist for the user) and update last_playlist_created_at in the users' table.
If my message-broker messages list was empty, I will repeat these steps (While - Do-While)
I think the processing method is good enough and can be scalable as well,
but the method we use to create the message for each user is not scalable!
How should I do to dispatch a large set of messages for each of my users?
Ok, so my answer is completely based on your comment where you mentioned that you use while(true) to check if the playlist needs to be updated which does not seem so trivial.
Although this is a design question and there are multiple solutions, here's how I would solve it.
First up, think of updating the playlist for a user as a job.
Now, in your case this is a scheduled Job. ie. once a day.
So, use a scheduler to schedule the next job time.
Write a Scheduled Job Handler to push this to a Message Queue. This part is just to handle multiple jobs at the same time where you could control the flow.
Generate the playlist for the user based on the job. Create a Schedule event for the next day.
You could persist Scheduled Job data just to avoid race conditions.

Queueing batch translations with laravel best method

Looking for some guidance on best architecture to accomplish what I am trying to do. I occasionally get spreadsheets that will have a column of data that will need to be translated. There could be anywhere from 200 to 10,000 rows in that column. What I want to do is pull all rows and add them to a redis queue. I am thinking Redis will be best as I can throttle the queue which is necessary as the api I am calling for translation has throttle limits. Once the translation is done I will put the translations into a new column and return the user a new spreadsheet with the additional column.
If anyone has ideas for best setup I am open but I want to stick with laravel as that is what the application is already running. I am just not sure if I should create one queue job and that queue process will just open the file and start doing the translations. Or do I add a queue for each row of text. Or lastly do I add all of the rows of text to a table in my database and then have a task scheduler running every minute that will check that table for any untranslated rows and process x amount of them each time is checks. Not sure about cron job running so frequently when this happens maybe twice a month.
I can see a lot of ways of doing it but looking for an ideal setup as what I don't want to happen is I hit throttle limits and lose potential translations I have done as it could error out.
Thanks for any advice

Scalable and efficient location updates in laravel

For a delivery-service application based on laravel, I want to keep the customer updated on the current location of the driver. For this purpose, I have a lat and long column in my order table. The driver has the website open and posts his html5 geolocation to the server every, let's say, 30 seconds. The row gets updated with the new position and here comes the question.
Will it be more efficient to
- have a Ajax request from the customer client every 30 seconds, that searches against all current orders with the customer id as key and retrieves the current location to update the maps,
or to
- create a private Chanel with pusher, subscribe to it from the customer client and create locationUpdated events, once the driver submits his location?
My thoughts would be to use pusher, so that I don't have to do two queries (update and retrieve) for each updated location, periodically and for possibly hundreds of users at the same time.
The disadvantage I assume to cause trouble would be the amount of channels to be maintained by the server, to make sure every client has access to updated information.
Unfortunately, I have no clue what would cause more effort to the server. Any argumentation why either of the two solutions is better than the other, or even further improvements are welcome.

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

How do you react to the absence of an event in a distributed system?

I have a system that collects session data. A session consists of a number of distinct events, for example "session started" and "action X performed". There is no way to determine when a session ends, so instead heartbeat events are sent at regular intervals.
This is the main complication: without a way to determine if a session has ended the only way is to try to react to the absence of an event, i.e. no more heartbeats. How can I do this efficiently and correctly in a distributed system?
Here is some more background to the problem:
The events must then be assembled into objects representing sessions. The session objects are later updated with additional data from other systems, and eventually they are used to calculate things like the number of sessions, average session length, etc.
The system must scale horizontally, so there are multiple servers that receive the events, and multiple servers that process them. Events belonging to the same session can be sent to and processed by different servers. This means that there's no guarantee that they will be processed in order, and there are additional complications that meant that events can be duplicated (and there's always the risk that some are lost, either before they reach our servers, or when processed).
Most of this exists already, but I have no good solution to how to efficiently and correctly determine when a session has ended. The way I do it now is to periodically search through the collection of "incomplete" session objects looking for any that have not been updated in an amount of time equal to two heartbeats, and moving these to another collection with "complete" sessions. This operation is time consuming and inefficient, and it doesn't scale well horizontally. Basically it consists of sorting a table on a column representing the last timestamp and filtering out any rows that aren't old enough. Sounds simple, but it's hard to parallelize, and if you do it too often you won't be doing anything else, the database will be busy filtering your data, if you don't do it often enough each run will be slow because there's too much to process.
I'd like to react to when a session has not been updated for a while, not poll every session to see if it's been updated.
Update: Just to give you a sense of scale; there are hundreds of thousands of sessions active at any time, and eventually there will be millions.
One possibility that comes to mind:
In your database table that keeps track of sessions, add a timestamp field (if you don't have one already) that records the last time the session was "active". Update the timestamp whenever you get a heartbeat.
When you create a session, schedule a "timer event" to fire after some suitable delay to check whether the session should be expired. When the timer event fires, check the session's timestamp to see if there's been more activity during the interval that the timer was waiting. If so, the session is still active, so schedule another timer event to check again later. If not, the session has timed out, so remove it.
If you use this approach, each session will always have one server responsible for checking whether it's expired, but different servers can be responsible for different sessions, so the workload can be spread around evenly. When a heartbeat comes in, it doesn't matter which server handles it, because it just updates a timestamp in a database that's (presumably) shared between all the servers.
There's still some polling involved since you'll get periodic timer events that make you check whether a session is expired even when it hasn't expired. That could be avoided if you could just cancel the pending timer event each time a heartbeat arrives, but with multiple servers that's tricky: the server that handles the heartbeat may not be the same one that has the timer scheduled. At any rate, the database query involved is lightweight: just looking up one row (the session record) by its primary key, with no sorting or inequality comparisons.
So you're collecting heartbeats; I'm wondering if you could have a batch process (or something) that ran across the collected heartbeats looking for patterns that implied the end of a session.
The level of accuracy is governed by how regular the heartbeats are and how often you scan across the collected heartbeats.
The advantage is you're processing all heartbeats through a single mechanism (in one spot - you don't have to poll each heartbeat on it's own) so that should be able to scale - if it was a database centric solution that should be able to cope with lots of data, right?
There might be a more elegant solution but my brains a bit full just now :)

Resources