Here is some sample code I've seen.
int expBackoff = (int) Math.pow(2, retryCount);
int maxJitter = (int) Math.ceil(expBackoff*0.2);
int finalBackoff = expBackoff + random.nextInt(maxJitter);
I was wondering what's the advantage of using a random jitter here?
Suppose you have multiple clients that send messages that collide. They all decide to back off. If they use the same deterministic algorithm to decide how long to wait, they will all retry at the same time -- resulting in another collision. Adding a random factor separates the retries.
It smooths traffic on the resource being requested.
If your request fails at a particular time, there's a good chance other requests are failing at almost exactly the same time. If all of these requests follow the same deterministic back-off strategy (say, retrying after 1, 2, 4, 8, 16... seconds), then everyone who failed the first time will retry at almost exactly the same time, and there's a good chance there will be more simultaneous requests than the service can handle, resulting in more failures. This same cluster of simultaneous requests can recur repeatedly, and likely fail repeatedly, even if the overall level of load on the service outside of those retry spikes is small.
By introducing jitter, the initial group of failing requests may be clustered in a very small window, say 100ms, but with each retry cycle, the cluster of requests spreads into a larger and larger time window, reducing the size of the spike at a given time. The service is likely to be able to handle the requests when spread over a sufficiently large window.
Randomization avoids the retries from several calls to happen at the same time.
More information on Exponential Backoff And Jitter can be found here: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Related
If i do a benchmark, and for example i found the following:
With 1 concurrent user, The api give 150 req/s. (9000 req/minute)
With more than 300 concurrent user, The api start throwing exception.
An app is doing request 1 every 30 minute.
Is it correct if I say:
the best cases is that the api could handle (30 * 9000 = 270.000 user). That is under 30 minute, there would be 270.000 sequential request and each are coming from different user
The worst cases would be when there is 300 user posting request at the same time.
And if it's true, would there any way to calculate the average case ?
Is is the same as calculating worst case, average case complexity of an algorithm ?
One theoretical tool to answer these questions is http://en.wikipedia.org/wiki/Queueing_theory. It says that you are very unlikely to get the level of performance that you are assuming, because the load applied to the system fluctuates, so that there are busy periods and quiet periods. If the system has nothing to do in quiet periods it is forced into idleness that you haven't accounted for. In busy periods, on the other hand, it will typically build up long queues of pending work, until the queues get so long that customers walk away, or the queues become longer than the system can support and it collapses, or both.
The graph at figure 1 page 3 of http://pages.cs.wisc.edu/~dsmyers/cs547/lecture_12_mm1_queue.pdf shows a graph of response time vs applied load for what is probably the most optimistic even vaguely realistic situation. You can see that response time gets very large as you approach maximum load.
By far the most sensible thing to do is to run tests which apply a realistic load to your application - this is important enough for people to build things like http://jmeter.apache.org/. If you want a rule of thumb I'd say don't plan to stress the system at more than 50% of theoretical capacity as you originally calculated.
I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
X processes generate a list of very large files. Each process generates one file at a time
Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
Each file has a size and naturally, bigger files will keep both X and Y more busy
Limitations
Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.
I can't think of other limitations at the moment.
Disadvantages to this approach
sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.
There are some more details that would help to see the picture more clearly.
Y processes are DB threads/jobs
X processes are web apps
Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part
Now I considered the following approach:
X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Update
Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
Jobs are independent from Processes. They will run at a speed and process how many files are possible.
When a Job finishes with a file it will send a HTTP request to the LB
Each process queues requests (files) coming from the LB
The LB works on a round robin rule
Diagram
The current LB idea is not good
The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.
Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.
This way consumers will always be optimally busy.
You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.
If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.
If the LB absolutelly cannot queue work
Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.
To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.
The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.
A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.
The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.
If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.
The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.
Background
Echo Nest have a rate limited API. A given application (identified in requests using an API key) can make up to 120 REST calls a minute. The service response includes an estimate of the total number of calls made in the last minute; repeated abuse of the API (exceeding the limit) may cause the API key to be revoked.
When used from a single machine (a web server providing a service to clients) it is easy to control access - the server has full knowledge of the history of requests and can regulate itself correctly.
But I am working on a program where distributed, independent clients make requests in parallel.
In such a case it is much less clear what an optimal solution would be. And in general the problem appears to be undecidable - if over 120 clients, all with no previous history, make an initial request at the same time, then the rate will be exceeded.
But since this is a personal project, and client use is expected to be sporadic (bursty), and my projects have never been hugely successful, that is not expected to be a huge problem. A more likely problem is that there are times when a smaller number of clients want to make many requests as quickly as possible (for example, a client may need, exceptionally, to make several thousand requests when starting for the first time - it is possible two clients would start at around the same time, so they must cooperate to share the available bandwidth).
Given all the above, what are suitable algorithms for the clients so that they rate-limit appropriately? Note that limited cooperation is possible because the API returns the total number of requests in the last minute for all clients.
Current Solution
My current solution (when the question was written - a better approach is given as an answer) is quite simple. Each client has a record of the time the last call was made and the number of calls made in the last minute, as reported by the API, on that call.
If the number of calls is less than 60 (half the limit) the client does not throttle. This allows for fast bursts of small numbers of requests.
Otherwise (ie when there are more previous requests) the client calculates the limiting rate it would need to work at (ie period = 60 / (120 - number of previous requests)) and then waits until the gap between the previous call and the current time exceeds that period (in seconds; 60 seconds in a minute; 120 max requests per minute). This effectively throttles the rate so that, if it were acting alone, it would not exceed the limit.
But the above has problems. If you think it through carefully you'll see that for large numbers of requests a single client oscillates and does not reach maximum throughput (this is partly because of the "initial burst" which will suddenly "fall outside the window" and partly because the algorithm does not make full use of its history). And multiple clients will cooperate to an extent, but I doubt that it is optimal.
Better Solutions
I can imagine a better solution that uses the full local history of the client and models other clients with, say, a Hidden Markov Model. So each client would use the API report to model the other (unknown) clients and adjust its rate accordingly.
I can also imagine an algorithm for a single client that progressively transitions from unlimited behaviour for small bursts to optimal, limited behaviour for many requests without introducing oscillations.
Do such approaches exist? Can anyone provide an implementation or reference? Can anyone think of better heuristics?
I imagine this is a known problem somewhere. In what field? Queuing theory?
I also guess (see comments earlier) that there is no optimal solution and that there may be some lore / tradition / accepted heuristic that works well in practice. I would love to know what... At the moment I am struggling to identify a similar problem in known network protocols (I imagine Perlman would have some beautiful solution if so).
I am also interested (to a lesser degree, for future reference if the program becomes popular) in a solution that requires a central server to aid collaboration.
Disclaimer
This question is not intended to be criticism of Echo Nest at all; their service and conditions of use are great. But the more I think about how best to use this, the more complex/interesting it becomes...
Also, each client has a local cache used to avoid repeating calls.
Updates
Possibly relevant paper.
The above worked, but was very noisy, and the code was a mess. I am now using a simpler approach:
Make a call
From the response, note the limit and count
Calculate
barrier = now() + 60 / max(1, (limit - count))**greedy
On the next call, wait until barrier
The idea is quite simple: that you should wait some length of time proportional to how few requests are left in that minute. For example, if count is 39 and limit is 40 then you wait an entire minute. But if count is zero then you can make a request soon. The greedy parameter is a trade-off - when greater than 1 the "first" calls are made more quickly, but you are more likely hit the limit and end up waiting for 60s.
The performance of this is similar to the approach above, and it's much more robust. It is particularly good when clients are "bursty" as the approach above gets confused trying to estimate linear rates, while this will happily let a client "steal" a few rapid requests when demand is low.
Code here.
After some experimenting, it seems that the most important thing is getting as good an estimate as possible for the upper limit of the current connection rates.
Each client can track their own (local) connection rate using a queue of timestamps. A timestamp is added to the queue on each connection and timestamps older than a minute are discarded. The "long term" (over a minute) average rate is then found from the first and last timestamps and the number of entries (minus one). The "short term" (instantaneous) rate can be found from the times of the last two requests. The upper limit is the maximum of these two values.
Each client can also estimate the external connection rate (from the other clients). The "long term" rate can be found from the number of "used" connections in the last minute, as reported by the server, corrected by the number of local connections (from the queue mentioned above). The "short term" rate can be estimated from the "used" number since the previous request (minus one, for the local connection), scaled by the time difference. Again, the upper limit (maximum of these two values) is used.
Each client computes these two rates (local and external) and then adds them to estimate the upper limit to the total rate of connections to the server. This value is compared with the target rate band, which is currently set to between 80% and 90% of the maximum (0.8 to 0.9 * 120 per minute).
From the difference between the estimated and target rates, each client modifies their own connection rate. This is done by taking the previous delta (time between the last connection and the one before) and scaling it by 1.1 (if the rate exceeds the target) or 0.9 (if the rate is lower than the target). The client then refuses to make a new connection until that scaled delta has passed (by sleeping if a new connected is requested).
Finally, nothing above forces all clients to equally share the bandwidth. So I add an additional 10% to the local rate estimate. This has the effect of preferentially over-estimating the rate for clients that have high rates, which makes them more likely to reduce their rate. In this way the "greedy" clients have a slightly stronger pressure to reduce consumption which, over the long term, appears to be sufficient to keep the distribution of resources balanced.
The important insights are:
By taking the maximum of "long term" and "short term" estimates the system is conservative (and more stable) when additional clients start up.
No client knows the total number of clients (unless it is zero or one), but all clients run the same code so can "trust" each other.
Given the above, you can't make "exact" calculations about what rate to use, but you can make a "constant" correction (in this case, +/- 10% factor) depending on the global rate.
The adjustment to the client connection frequency is made to the delta between the last two connection (adjusting based on the average over the whole minute is too slow and leads to oscillations).
Balanced consumption can be achieved by penalising the greedy clients slightly.
In (limited) experiments this works fairly well (even in the worst case of multiple clients starting at once). The main drawbacks are: (1) it doesn't allow for an initial "burst" (which would improve throughput if the server has few clients and a client has only a few requests); (2) the system does still oscillate over ~ a minute (see below); (3) handling a larger number of clients (in the worst case, eg if they all start at once) requires a larger gain (eg 20% correction instead of 10%) which tends to make the system less stable.
The "used" amount reported by the (test) server, plotted against time (Unix epoch). This is for four clients (coloured), all trying to consume as much data as possible.
The oscillations come from the usual source - corrections lag signal. They are damped by (1) using the upper limit of the rates (predicting long term rate from instantaneous value) and (2) using a target band. This is why an answer informed by someone who understand control theory would be appreciated...
It's not clear to me that estimating local and external rates separately is important (they may help if the short term rate for one is high while the long-term rate for the other is high), but I doubt removing it will improve things.
In conclusion: this is all pretty much as I expected, for this kind of approach. It kind-of works, but because it's a simple feedback-based approach it's only stable within a limited range of parameters. I don't know what alternatives might be possible.
Since you're using the Echonest API, why don't you take advantage of the rate limit headers that are returned with every API call?
In general you get 120 requests per minute. There are three headers that can help you self-regulate your API consumption:
X-Ratelimit-Used
X-Ratelimit-Remaining
X-Ratelimit-Limit
**(Notice the lower-case 'ell' in 'Ratelimit'--the documentation makes you think it should be capitalized, but in practice it is lower case.)
These counts account for calls made by other processes using your API key.
Pretty neat, huh? Well, I'm afraid there is a rub...
That 120-request-per-minute is really an upper bound. You can't count on it. The documentation states that value can fluctuate according to system load. I've seen it as low as 40ish in some calls I've made, and have in some cases seen it go below zero (I really hope that was a bug in the echonest API!)
One approach you can take is to slow things down once utilization (used divided by limit) reaches a certain threshold. Keep in mind though, that on the next call your limit may have been adjusted download significantly enough that 'used' is greater than 'limit'.
This works well up until a point. Since the Echonest doesn't adjust the limit in a predictable mannar, it is hard to avoid 400s in practice.
Here are some links that I've found helpful:
http://blog.echonest.com/post/15242456852/managing-your-api-rate-limit
http://developer.echonest.com/docs/v4/#rate-limits
I'm working on a UDP server/client configuration. The client sends the server a single packet, which varies in size but is usually <500 bytes. The server responds essentially instantly with a single outgoing packet, usually smaller than the incoming request packet. Complete transactions always consist of a single packet exchange.
If the client doesn't see the response within T amount of time, it retries R times, increasing T by X before each retry, before finally giving up and returning an error. Currently, R is never changed.
Is there any special logic to choosing optimum initial T (wait time), R (retries), and X (wait increase)? How persistent should retries be (ie, what minimum R to use) to reach some approximation of a "reliable" protocol?
This is similar to question 5227520. Googling "tcp retries" and "tcp retransmission" leads to lots of suggestions that have been tried over the years. Unfortunately, no single solution appears optimum.
I'd choose T to start at 2 or 3 seconds. My increase X would be half of T (doubling T seems popular, but you quickly get long timeouts). I'd adjust R on the fly to be at least 5 and more if necessary so my total timeout is at least a minute or two.
I'd be careful not to leave R and T too high if subsequent transactions are usually quicker; you might want to lower R and T as your stats allow so you can retry and get a quick response instead of leaving R and T at their max (especially if your clients are human and you want to be responsive).
Keep in mind: you're never going to be as reliable as an algorithm that retries more than you, if those retries succeed. On the other hand, if your server is always available and always "responds essentially instantly" then if the client fails to see a response it's a failure out of your server's control and the only thing that can be done is for the client to retry (although a retry can be more than just resending, such as closing/reopening the connection, trying a backup server at a different IP, etc).
The minimum timeout should be the path latency, or half the Round-Trip-Time (RTT).
See RFC 908 — Reliable Data Protocol.
The big question is deciding what happens after one timeout, do you reset to the same timeout or do you double up? This is a complicated decision based on the size on the frequency of the communication and how fair you wish to play with others.
If you are finding packets are frequently lost and latency is a concern then you want to look at either keeping the same timeout or having a slow ramp up to exponential timeouts, e.g. 1x, 1x, 1x, 1x, 2x, 4x, 8x, 16x, 32x.
If bandwidth isn't much of a concern but latency really is, then follow UDP-based Data Transfer Protocol (UDT) and force the data through with low timeouts and redundant delivery. This is useful for WAN environments, especially intercontinental distances and why UDT is frequently found within WAN accelerators.
More likely latency isn't that much of a concern and fairness to other protocols is preferred, then use a standard back-off pattern, 1x, 2x, 4x, 8x, 16x, 32x.
Ideally the implementation of the protocol handling should be advanced to automatically derive the optimum timeout and retry periods. When there is no data loss you do not need redundant delivery, when there is data loss you need to increase delivery. For timeouts you may wish to consider reducing the timeout in optimum conditions then slowing down when congestion occurs to prevent synonymous broadcast storms.
I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.