Efficient Job Scheduling Algorithm - algorithm

I host a video encoding service. My clients submit some videos to encode. Presently, I encode them in just a FIFO manner. I would like to change it so that every client gets a fare share of my service.
I would like to consider the following factors:
1. FIFO - First to submit the job has higher priority.
2. If the client submits a large number of videos, I would like to take his priority down.
I have control over the priority of the clients by setting an attribute from the database, but not over each video.
How can I have more control over all the videos and schedule them efficiently?
PS: I can redesign the database if needed.

Here's an approach that's fair to clients: Instead of queueing jobs, queue the clients who submit jobs into a FIFO. As each client reaches the front of the queue, encode their first job, and if that client has more jobs, put that client back in the queue at the end.

You could compute a cost for encoding (based on size and/or quality), and divide it by the age of the request (curent time minus time the request was issued plus some damping constant).
You would then use this score to queue the request in increasing order.
A client queing big requests would have to wait for smaller requests to be processed, but would eventually get a score small enough to be scheduled.

Related

Advice on pubsub topic division based on geohashes for ably websocket connection service

My question concerns the following use case:
Use case actors
User A: The user who sets a broadcast region and views stream with live posts.
User B: The first user who sends a broadcast message from within the broadcast region set by user A.
User C: The second user who sends a broadcast message from within the broadcast region set by user A.
Use case description
User A selects a broadcast region within which boundaries (radius) (s)he wants to receive live broadcast messages.
User A opens the livefeed and requests an initial set of livefeed items.
User B broadcasts a message from within the broadcast region of user A while user A’s livefeed is still open.
A label with 1 new livefeed item appears at the top of User A’s livefeed while it is open.
As user C publishes another livefeed post from within the selected broadcast region from user A, the label counter increments.
User A receives a notification similar to this example of Facebook:
The solution I thought to apply (and which I think Pubnub uses), is to create a topic per geohash.
In my case that would mean that for every user who broadcasted a message, it needs to be published to the geohash-topic, and clients (app / website users) would consume the geohash-topic through a websocket if it fell within the range of the defined area (radius). Ably seems to provide this kind of scalable service using web sockets.
I guess it would simplified be something like this:
So this means that a geohash needs to be extracted from the current location from where the broadcast message is sent. This geohash should have granular scale that is small enough so that the receiving user can set a broadcast region that is more or less accurate. (I.e. the geohash should have enough accuracy if we want to allow users to define a broadcast region within which to receive live messages, which means that one should expect a quite large amount of topics if we decided to scale).
Option 2 would be to create topics for a geohash that has a less specific granularity (covering a larger area), and let clients handle the accuracy based on latlng values that are sent along with the message.
The client would then decide whether or not to drop messages. However, this means more messages are sent (more overhead), and a higher cost.
I don't have experience with this kind of architecture, and question the viability / scalability of this approach.
Could you think of an alternate solution to this question to achieve the desired result or provide more insight on how to solve this kind of problem overall? (I also considered using regular req-res flow, but this means spamming the server, which also doesn't seem like a very good solution).
I actually checked.
Given a region of 161.4 km² (like region Brussels), the division of geohashes by length of the string is as follows:
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
6 ≤ 1.22km × 0.61km
7 ≤ 153m × 153m
8 ≤ 38.2m × 19.1m
9 ≤ 4.77m × 4.77m
10 ≤ 1.19m × 0.596m
11 ≤ 149mm × 149mm
12 ≤ 37.2mm × 18.6mm
Given that we would allow users to have a possible inaccuracy up to 153m (on the region to which users may want to subscribe to receive local broadcast messages), it would require an amount of topics that is definitely already too large to even only cover the entire region of Brussels.
So I'm still a bit stuck at this level currently.
1. PubNub
PubNub is currently the only service that offers an out of the box geohash pub-sub solution over websockets, but their pricing is extremely high (500 connected devices cost about 49$, 20k devices cost 799$) UPDATE: PubNub has updated price, now with unlimited devices. Website updates coming soon.
Pubnub is working on their pricing model because some of their customers were paying a lot for unexpected spikes in traffic.
However, it will not be a viable solution for a generic broadcasting messaging app that is meant to be open for everybody, and for which traffic is therefore very highly unpredictable.
This is a pity, since this service would have been the perfect solution for us otherwise.
2. Ably
Ably offers a pubsub system to stream data to clients over websockets for custom channels. Channels are created dynamically when a client attaches itself in order to either publish or subscribe to that channel.
The main problem here is that:
If we want high geohash accuracy, we need a high number of channels and hence we have to pay more;
If we go with low geohash accuracy, there will be a lot of redundant messaging:
Let's say that we take a channel that is represented by a geohash of 4 characters, spanning a geographical area of 39.1 x 19.5 km.
Any post that gets sent to that channel, would be multiplexed to everybody within that region who is currently listening.
However, let's say that our app allows for a maximum radius of 10km, and half of the connected users has its setting to a 1km radius.
This means that all posts outside of that 2km radius will be multiplexed to these users unnecessarily, and will just be dropped without having any further use.
We should also take into account the scalability of this approach. For every geohash that either producer or consumer needs, another channel will be created.
It is definitely more expensive to have an app that requires topics based on geohashes worldwide, than an app that requires only theme-based topics.
That is, on world-wide adoption, the number of topics increases dramatically, hence will the price.
Another consideration is that our app requires an additional number of channels:
By geohash and group: Our app allows the possibility to create geolocation based groups (which would be the equivalent of Twitter like #hashtags).
By place
By followed users (premium feature)
There are a few optimistic considerations to this approach despite:
Streaming is only required when the newsfeed is active:
when the user has a browser window open with our website +
when the user is on a mobile device, and actively has the related feed open
Further optimisation can be done, e.g. only start streaming as from 10
to 20 seconds after refresh of the feed
Streaming by place / followed users may have high traffic depending on current activity, but many place channels will be idle as well
A very important note in this regard is how Ably bills its consumers, which can be used to our full advantage:
A channel is opened when any of the following happens:
A message is published on the channel via REST
A realtime client attaches to the channel. The channel remains active for the entire time the client is attached to that channel, so
if you connect to Ably, attach to a channel, and publish a message but
never detach the channel, the channel will remain active for as long
as that connection remains open.
A channel that is open will automatically close when all of the
following conditions apply:
There are no more realtime clients attached to the channel At least
two minutes has passed since the last message was published. We keep
channels alive for two minutes to ensure that we can provide
continuity on the channel as part of our connection state recovery.
As an example, if you have 10,000 users, and at your busiest time of
the month there is a single spike where 500 customers establish a
realtime connection to Ably and each attach to one unique channel and
one global shared channel, the peak number of channels would be the
sum of the 500 unique channels per client and the one global shared
channel i.e. 501 peak channels. If throughout the month each of those
10,000 users connects and attaches to their own unique channel, but
not necessarily at the same time, then this does not affect your peak
channel count as peak channels is the concurrent number of channels
open at any point of time during that month.
Optimistic conclusion
The most important conclusion is that we should consider that this feature may not be as crucial as believe it is for a first version of the app.
Although Twitter, Facebook, etc offer this feature of receiving live updates (and users have grown to expect it), an initial beta of our app on a limited scale can work without, i.e. the user has to refresh in order to receive new updates.
During a first launch of the app, statistics can ba gathered to gain more insight into detailed user behaviour. This will enable us to build more solid infrastructural and financial reflections based on factual data.
Putting aside the question of Ably, Pubnub and a DIY solution, the core of the question is this:
Where is message filtering taking place?
There are three possible solution:
The Pub/Sub service.
The Server (WebSocket connection handler).
Client side (the client's device).
Since this is obviously a mobile oriented approach, client side message filtering is extremely rude, as it increases data consumption by the client while much of the data might be irrelevant.
Client side filtering will also increase battery consumption and will likely result in lower acceptance rates by clients.
This leaves pub/sub filtering (channel names / pattern matching) and server-side filtering.
Pub/Sub channel name filtering
A single pub/sub service serves a number of servers (if not all of them), making it a very expensive resource (relative to the resources we have at hand).
Using channel names to filter messages would be ideal - as long as the filtering is cheap (using exact matches with channel name hash mapping).
However, pattern matching (when subscribing to channels with inexact names, such as "users.*") is very expansive when compared to exact pattern matching.
This means that Pub/Sub channel name filtering can't be used to filter all the messages without overloading the pub/sub system.
Server side filtering
Since a server accepts WebSocket connections and bridges between the WebSocket and the pub/sub service, it's in an ideal position to filter the messages.
However, we don't want the server to process all the messages for all the clients for each connection, as this is an extreme duplication of effort.
Hybrid solution
A classic solution would divide the earth into manageable sections (1 sq. km per section will require 510.1 million unique channel names for full coverage... but I would suggest that the 70% ocean space should be neglected).
Busy sections might be subdivided (NYC might require a section per 250 sq meters rather than 1 sq kilometer).
This allows publishers to publish to exact channel names and subscribers to subscribe to exact channel names.
Publishers might need to publish to more than one channel and subscribers might need to subscribe to more than one channel, depending on their exact location and the grid's borders.
This filtering scheme will filter much, but not all.
The server node will need to look into the message, review it's exact geo-location and filter messages before deciding if they should be sent along the WebSocket connection to the client.
Why the Hybrid Solution?
This allows the system to scale with relative ease.
Since server nodes are (by design) cheaper than the pub/sub service, they could be used to handle the exact location filtering (the heavy work).
At the same time, the strength of the pub/sub system can be used to minimize the server's workload and filter the obvious mis-matches.
Pubnub vs. Ably?
I don't know. I didn't use either of them. I worked with Redis and implemented my own pub/sub solution.
I assume they are both great and it's really up to your needs.
Personally I prefer the DIY approach when it comes to customized or complex situations. IMHO, this seems like it would fall into the DIY category if I were to implement it.

How to minimize response time, maximize total system performance in an environment where the the clients send request at a vastly different rate?

In a time-sensitive environment, what is an optimal algorithm to ensure that requests from all clients are handled as quickly and fairly (if competition for computing resources exists, then the client with few pending requests will receive a high processing priority and vice versa) as possible but also ensure that no server computing resources is wasted? For example, a server handles requests from 2 clients, which client A sends out request at a rate of 1req/s and client B sends at a rate of 1000req/s, assuming the server can handle 1001req/s. The rate of requests are only averages, which means any number of requests can be sent at anytime.
EDIT: all requests are homogenous and only one thread can fulfill them
My initial though is to use a round-robin approach, where each client's requests are put into different FIFO queues and the server pull requests from each queue on a round-robin basis. That can ensure that the server is handling requests all the time and the request from client A is not overcrowded by client B.
Please comment on my approach and share any other ideas you have.
I assume your proposed round-robin processing is not related to any pre-emptive mechanism. In that sense, the definition of 'fair' plays an important role. Let's assume that fair relates to the more request you put in, the less number of processed requests are handled if other request exist. For example, if we have 100 request waiting from client A and 50 requests waiting from client B, you process 2 request from client A after which you process 1 request from client B, etc. In other words, include the number of requests per client related to the total number of request to your round-robin scheduler.

bigquery-api-go-client, bigquery streaming inserts latency

Is there an optimal way to improve the speed of requests to an insertAll request via the golang client?
My current structure is such that:
Rows get submit to a queue
Upon hitting job size, a new job is queued up (approx. 250 rows)
A worker picks up a job from the queue
Worker formats + sends insertAll request to BQ
The above works fairly well, however I seem to be having an average of 8-12 seconds per 250 row request.
I should mention that the number of workers being used is generally 200, however I have tried higher / lower values and it does not seem to make much a difference (higher usually seems to take longer per job).

Socket.io scaling - Reduce # of messages or size of messages

I'm sending a lot of messages over a socket.io node.js server, and I have the ability to send over multiple messages in an array, or send each one individually. My question is, what scales better: a larger number of messages sending a smaller amount of data, or fewer messages sending a larger amount of data?
Generally speaking, what #freakish said holds: fewer messages is better, mainly because of the overhead of each message.
There are certain cases, however, when this may not be true.
For instance, if your server has a single processing thread, and your client is sending a very large amount of data, say megabytes, then other clients trying to communicate with the server would block until the transfer completes. If you send the data in smaller chunks, then other clients would be processed in between request. If you happened to have this situation, then you may want to do some testing to find the optimal maximum message size.

Fair job processing algorithm

I've got a machine that accepts user uploads, performs some processing on them, and then returns the result. It usually takes a few minutes to process each upload received.
The problem is, a few users can upload a lot of jobs that basically deny processing to other users for a long time. I thought of just setting a hard cap and using priority queues, e.g. after 5 uploads in an hour, all new uploads are given a lower processing priority. I basically want to process ALL jobs, but I don't want the user who uploaded 1000 jobs to make everyone wait.
My question is, is there a better way to do this?
My goal is to minimize the time between the upload and the result being returned. It would be ideal if the algorithm could work in a distributed manner as well.
Thanks
Implementation will vary widely depending on what these jobs are and how long they take and how varied the processing times are, as well as how likely there is to be a fatal error during the process.
That being said, an easy way to maintain an even distribution of jobs across users is to maintain a list of all the users who have submitted jobs. When you are ready to get a new job, rather than just taking the next job out of a random queue, cycle through the users taking the top job from each user each time.
Again, this can be accomplished a number of ways, I would recommend a map from users to their respective list of jobs submitted. Cycle through the keys of the map each time you are ready for a new job. then get the list of jobs for whatever key you are on, and do the first job.
This is assuming that each job is "atomic" in that one job is not dependent on being executed next to the jobs it was submitted with.
Hope that helps, of course I could have completely misunderstood what you are asking for.
You don't have to roll-your-own. There is Sun Grid Engine. An open-source tool that is built to do that sort of thing, and if you are willing to pay, there is Platform LSF, which I use at work.
What is the maximum # of jobs a user can submit? Can users submit 1 job a a time OR is it a batch of jobs?
So your algorithm would go something like this
If the User has submitted jobs Then
Check how many jobs per hour
If the jobs per hour > than the average Then
Modify the users profile to a lower priority
Else
Check Users priority level and restore
End If
If the priority = HIGH
process right away
Else If priority = MEDIUM
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Else If priority = LOW
Check Queue for High Priority
If High Priority Found (rerun this loop)
Else Process
Check Queue for Medium Priority
If Medium Priority Found (rerun this loop)
Else Process
Process Queue
End If
You can use a graph algorithm like Edmond's Blossom V to assign all users and jobs to a process. If a user can upload more then another user it would be more simplier for him to find a process. With the Blossom V algorithm you can define a threshold to not exceed the maximum process the server can handle.

Resources