I'm working on a UDP server/client configuration. The client sends the server a single packet, which varies in size but is usually <500 bytes. The server responds essentially instantly with a single outgoing packet, usually smaller than the incoming request packet. Complete transactions always consist of a single packet exchange.
If the client doesn't see the response within T amount of time, it retries R times, increasing T by X before each retry, before finally giving up and returning an error. Currently, R is never changed.
Is there any special logic to choosing optimum initial T (wait time), R (retries), and X (wait increase)? How persistent should retries be (ie, what minimum R to use) to reach some approximation of a "reliable" protocol?
This is similar to question 5227520. Googling "tcp retries" and "tcp retransmission" leads to lots of suggestions that have been tried over the years. Unfortunately, no single solution appears optimum.
I'd choose T to start at 2 or 3 seconds. My increase X would be half of T (doubling T seems popular, but you quickly get long timeouts). I'd adjust R on the fly to be at least 5 and more if necessary so my total timeout is at least a minute or two.
I'd be careful not to leave R and T too high if subsequent transactions are usually quicker; you might want to lower R and T as your stats allow so you can retry and get a quick response instead of leaving R and T at their max (especially if your clients are human and you want to be responsive).
Keep in mind: you're never going to be as reliable as an algorithm that retries more than you, if those retries succeed. On the other hand, if your server is always available and always "responds essentially instantly" then if the client fails to see a response it's a failure out of your server's control and the only thing that can be done is for the client to retry (although a retry can be more than just resending, such as closing/reopening the connection, trying a backup server at a different IP, etc).
The minimum timeout should be the path latency, or half the Round-Trip-Time (RTT).
See RFC 908 — Reliable Data Protocol.
The big question is deciding what happens after one timeout, do you reset to the same timeout or do you double up? This is a complicated decision based on the size on the frequency of the communication and how fair you wish to play with others.
If you are finding packets are frequently lost and latency is a concern then you want to look at either keeping the same timeout or having a slow ramp up to exponential timeouts, e.g. 1x, 1x, 1x, 1x, 2x, 4x, 8x, 16x, 32x.
If bandwidth isn't much of a concern but latency really is, then follow UDP-based Data Transfer Protocol (UDT) and force the data through with low timeouts and redundant delivery. This is useful for WAN environments, especially intercontinental distances and why UDT is frequently found within WAN accelerators.
More likely latency isn't that much of a concern and fairness to other protocols is preferred, then use a standard back-off pattern, 1x, 2x, 4x, 8x, 16x, 32x.
Ideally the implementation of the protocol handling should be advanced to automatically derive the optimum timeout and retry periods. When there is no data loss you do not need redundant delivery, when there is data loss you need to increase delivery. For timeouts you may wish to consider reducing the timeout in optimum conditions then slowing down when congestion occurs to prevent synonymous broadcast storms.
Related
Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit
OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).
When performing AJAX requests, I have always tried to do as few as possible since there is an overhead to each request having to open the http connection to send the data. Since a websocket connection is constantly open, is there any cost outside of the obvious packet bandwidth to sending a request?
For example. Over the space of 1 minute, a client will send 100kb of data to the server. Assuming the client does not need a response to any of these requests, is there any advantage to queuing packets and sending them in one big burst vs sending them as they are ready?
In other words, is there an overhead to the stopping and starting data transfer for a connection that is constantly open?
I want to make a multiplayer browser game as real time as possible, but I don't want to find that 100s of tiny requests per minute compared to a larger consolidated request is causing the server additional stress. I understand that if the client needs a response it will be slower as there is a lot of waiting from the back and forth. I will consider this and only consolidate when it is appropriate. The more smaller requests per minute, the better user experience, but I don't know what toll it will have on the server.
You are correct that a webSocket message will have lower overhead for a given message transmission than sending the same message via an Ajax call because the webSocket connection is already established and because a webSocket message has lower overhead than an HTTP request.
First off, there's always less overhead in sending one larger transmission vs. sending lots of smaller transmissions. That's just the nature of TCP. Every TCP packet gets separately processed and acknowledged so sending more of them costs a bit more overhead. Whether that difference is relevant or significant and worth writing extra code for or worth sacrificing some element of your user experience (because of the delay for batching) depends entirely upon the specifics of a given situation.
Since you've described a situation where your client gets the best experience if there is no delay and no batching of packets, then it seems that what you should do is not implement the batching and test out how your server handles the load with lots of smaller packets when it gets pretty busy. If that works just fine, then stay with the better user experience. If you have issues keeping up with the load, then seriously profile your server and find out where the main bottleneck to performance is (you will probably be surprised about where the bottleneck actually is as it is often not where you think it will be - that's why you have to profile and measure to know where to concentrate your energy for improving the scalability).
FYI, due to the implementation of Nagel's algorithm in most implementations of TCP, the TCP stack itself does small amounts of batching for you if you are sending multiple requests fairly closely spaced in time or if sending over a slower link.
It's also possible to implement a dynamic system where as long as your server is able to keep up, you keep with the smaller and more responsive packets, but if your server starts to get busy, you start batching in order to reduce the number of separate transmissions.
I'm wondering about the trade-offs between two approaches to handling HTTP timeouts between two services. Service A is trying to implement retry functionality when calling service B.
Approach 1: This is the typical approach (e.g. Ethernet proto). Perform a request with fixed timeout T. If timeout occurs, sleep for X and retry the request. Increase X exponentially.
Approach 2: Instead of sleeping between retries, increase the actual HTTP timeout value (say, exponentially). In both cases, consider a max-bound.
For Ethernet, this makes sense because of it's low-level location in the network stack. However, for an application-level retry mechanism, would approach 2 be more appropriate? In a situation where there are high levels of network congestion, I would think #2 is better for a couple reasons:
Sending additional TCP connection requests will only flood the network more
You're basically guaranteed to not receive a response when you're sleeping (because you already timed out and/or tore down the socket), whereas if you instead just allowed the TCP request to remain outstanding (or kept the socket open if the connection has at least been established), you at least have the possibility of success occurring.
Any thoughts on this?
On a high-packet-loss network (e.g. cellular, or wi-fi near the limits of its range), there's a distinct possibility that your requests will continue to time out forever if the timeout is too short. So increasing the timeout is often a good idea.
And retrying the request immediately often works, and if it doesn't, waiting a while might make no difference (e.g. if you no longer have a network connection). For example, on iOS, your best bet is to use reachability, and if reachability determines that the network is down, there's no reason to retry until it isn't.
My general thoughts are that for short requests (i.e. not uploading/downloading large files) if you haven't received any response from the server at all after 3-5 seconds, start a second request in parallel. Whichever request returns a header first wins. Cancel the other one. Keep the timeout at 90 seconds. If that fails, see if you can reach generate_204.
If generate_204 works, the problem could be a server issue. Retry immediately, but flag the server as suspect. If that retry fails a second time (after a successful generate_204 response), start your exponential backoff waiting for the server (with a cap on the maximum interval).
If the generate_204 request doesn't respond, your network is dead. Wait for a network change, trying only very occasionally (e.g. every couple of minutes minimum).
If the network connectivity changes (i.e. if you suddenly have Wi-Fi), restart any waiting connections after a few seconds. There's no reason to wait the full time at that point, because everything has changed.
But obviously there's no correct answer. This approach is fairly aggressive. Others might take the opposite approach. It all depends on what your goals are.
There's not much point in sleeping when you could be doing useful work, or in using a shorter timeout than you can really tolerate. I would use (2).
The idea that Ethernet or indeed anything uses (1) seems fanciful. Do you have a citation?
Background
Echo Nest have a rate limited API. A given application (identified in requests using an API key) can make up to 120 REST calls a minute. The service response includes an estimate of the total number of calls made in the last minute; repeated abuse of the API (exceeding the limit) may cause the API key to be revoked.
When used from a single machine (a web server providing a service to clients) it is easy to control access - the server has full knowledge of the history of requests and can regulate itself correctly.
But I am working on a program where distributed, independent clients make requests in parallel.
In such a case it is much less clear what an optimal solution would be. And in general the problem appears to be undecidable - if over 120 clients, all with no previous history, make an initial request at the same time, then the rate will be exceeded.
But since this is a personal project, and client use is expected to be sporadic (bursty), and my projects have never been hugely successful, that is not expected to be a huge problem. A more likely problem is that there are times when a smaller number of clients want to make many requests as quickly as possible (for example, a client may need, exceptionally, to make several thousand requests when starting for the first time - it is possible two clients would start at around the same time, so they must cooperate to share the available bandwidth).
Given all the above, what are suitable algorithms for the clients so that they rate-limit appropriately? Note that limited cooperation is possible because the API returns the total number of requests in the last minute for all clients.
Current Solution
My current solution (when the question was written - a better approach is given as an answer) is quite simple. Each client has a record of the time the last call was made and the number of calls made in the last minute, as reported by the API, on that call.
If the number of calls is less than 60 (half the limit) the client does not throttle. This allows for fast bursts of small numbers of requests.
Otherwise (ie when there are more previous requests) the client calculates the limiting rate it would need to work at (ie period = 60 / (120 - number of previous requests)) and then waits until the gap between the previous call and the current time exceeds that period (in seconds; 60 seconds in a minute; 120 max requests per minute). This effectively throttles the rate so that, if it were acting alone, it would not exceed the limit.
But the above has problems. If you think it through carefully you'll see that for large numbers of requests a single client oscillates and does not reach maximum throughput (this is partly because of the "initial burst" which will suddenly "fall outside the window" and partly because the algorithm does not make full use of its history). And multiple clients will cooperate to an extent, but I doubt that it is optimal.
Better Solutions
I can imagine a better solution that uses the full local history of the client and models other clients with, say, a Hidden Markov Model. So each client would use the API report to model the other (unknown) clients and adjust its rate accordingly.
I can also imagine an algorithm for a single client that progressively transitions from unlimited behaviour for small bursts to optimal, limited behaviour for many requests without introducing oscillations.
Do such approaches exist? Can anyone provide an implementation or reference? Can anyone think of better heuristics?
I imagine this is a known problem somewhere. In what field? Queuing theory?
I also guess (see comments earlier) that there is no optimal solution and that there may be some lore / tradition / accepted heuristic that works well in practice. I would love to know what... At the moment I am struggling to identify a similar problem in known network protocols (I imagine Perlman would have some beautiful solution if so).
I am also interested (to a lesser degree, for future reference if the program becomes popular) in a solution that requires a central server to aid collaboration.
Disclaimer
This question is not intended to be criticism of Echo Nest at all; their service and conditions of use are great. But the more I think about how best to use this, the more complex/interesting it becomes...
Also, each client has a local cache used to avoid repeating calls.
Updates
Possibly relevant paper.
The above worked, but was very noisy, and the code was a mess. I am now using a simpler approach:
Make a call
From the response, note the limit and count
Calculate
barrier = now() + 60 / max(1, (limit - count))**greedy
On the next call, wait until barrier
The idea is quite simple: that you should wait some length of time proportional to how few requests are left in that minute. For example, if count is 39 and limit is 40 then you wait an entire minute. But if count is zero then you can make a request soon. The greedy parameter is a trade-off - when greater than 1 the "first" calls are made more quickly, but you are more likely hit the limit and end up waiting for 60s.
The performance of this is similar to the approach above, and it's much more robust. It is particularly good when clients are "bursty" as the approach above gets confused trying to estimate linear rates, while this will happily let a client "steal" a few rapid requests when demand is low.
Code here.
After some experimenting, it seems that the most important thing is getting as good an estimate as possible for the upper limit of the current connection rates.
Each client can track their own (local) connection rate using a queue of timestamps. A timestamp is added to the queue on each connection and timestamps older than a minute are discarded. The "long term" (over a minute) average rate is then found from the first and last timestamps and the number of entries (minus one). The "short term" (instantaneous) rate can be found from the times of the last two requests. The upper limit is the maximum of these two values.
Each client can also estimate the external connection rate (from the other clients). The "long term" rate can be found from the number of "used" connections in the last minute, as reported by the server, corrected by the number of local connections (from the queue mentioned above). The "short term" rate can be estimated from the "used" number since the previous request (minus one, for the local connection), scaled by the time difference. Again, the upper limit (maximum of these two values) is used.
Each client computes these two rates (local and external) and then adds them to estimate the upper limit to the total rate of connections to the server. This value is compared with the target rate band, which is currently set to between 80% and 90% of the maximum (0.8 to 0.9 * 120 per minute).
From the difference between the estimated and target rates, each client modifies their own connection rate. This is done by taking the previous delta (time between the last connection and the one before) and scaling it by 1.1 (if the rate exceeds the target) or 0.9 (if the rate is lower than the target). The client then refuses to make a new connection until that scaled delta has passed (by sleeping if a new connected is requested).
Finally, nothing above forces all clients to equally share the bandwidth. So I add an additional 10% to the local rate estimate. This has the effect of preferentially over-estimating the rate for clients that have high rates, which makes them more likely to reduce their rate. In this way the "greedy" clients have a slightly stronger pressure to reduce consumption which, over the long term, appears to be sufficient to keep the distribution of resources balanced.
The important insights are:
By taking the maximum of "long term" and "short term" estimates the system is conservative (and more stable) when additional clients start up.
No client knows the total number of clients (unless it is zero or one), but all clients run the same code so can "trust" each other.
Given the above, you can't make "exact" calculations about what rate to use, but you can make a "constant" correction (in this case, +/- 10% factor) depending on the global rate.
The adjustment to the client connection frequency is made to the delta between the last two connection (adjusting based on the average over the whole minute is too slow and leads to oscillations).
Balanced consumption can be achieved by penalising the greedy clients slightly.
In (limited) experiments this works fairly well (even in the worst case of multiple clients starting at once). The main drawbacks are: (1) it doesn't allow for an initial "burst" (which would improve throughput if the server has few clients and a client has only a few requests); (2) the system does still oscillate over ~ a minute (see below); (3) handling a larger number of clients (in the worst case, eg if they all start at once) requires a larger gain (eg 20% correction instead of 10%) which tends to make the system less stable.
The "used" amount reported by the (test) server, plotted against time (Unix epoch). This is for four clients (coloured), all trying to consume as much data as possible.
The oscillations come from the usual source - corrections lag signal. They are damped by (1) using the upper limit of the rates (predicting long term rate from instantaneous value) and (2) using a target band. This is why an answer informed by someone who understand control theory would be appreciated...
It's not clear to me that estimating local and external rates separately is important (they may help if the short term rate for one is high while the long-term rate for the other is high), but I doubt removing it will improve things.
In conclusion: this is all pretty much as I expected, for this kind of approach. It kind-of works, but because it's a simple feedback-based approach it's only stable within a limited range of parameters. I don't know what alternatives might be possible.
Since you're using the Echonest API, why don't you take advantage of the rate limit headers that are returned with every API call?
In general you get 120 requests per minute. There are three headers that can help you self-regulate your API consumption:
X-Ratelimit-Used
X-Ratelimit-Remaining
X-Ratelimit-Limit
**(Notice the lower-case 'ell' in 'Ratelimit'--the documentation makes you think it should be capitalized, but in practice it is lower case.)
These counts account for calls made by other processes using your API key.
Pretty neat, huh? Well, I'm afraid there is a rub...
That 120-request-per-minute is really an upper bound. You can't count on it. The documentation states that value can fluctuate according to system load. I've seen it as low as 40ish in some calls I've made, and have in some cases seen it go below zero (I really hope that was a bug in the echonest API!)
One approach you can take is to slow things down once utilization (used divided by limit) reaches a certain threshold. Keep in mind though, that on the next call your limit may have been adjusted download significantly enough that 'used' is greater than 'limit'.
This works well up until a point. Since the Echonest doesn't adjust the limit in a predictable mannar, it is hard to avoid 400s in practice.
Here are some links that I've found helpful:
http://blog.echonest.com/post/15242456852/managing-your-api-rate-limit
http://developer.echonest.com/docs/v4/#rate-limits
The Wikipedia entry doesn't give details and the RFC is way too dense. Does anyone around here know, in a very general way, how NTP works?
I'm looking for an overview that explains how Marzullo's algorithm (or a modification of it) is employed to translate a timestamp on a server into a timestamp on a client. Specifically what mechanism is used to produce accuracy which is, on average, within 10ms when that communication takes place over a network with highly variable latency which is frequently several times that.
(This isn't Marzullo's algorithm. That's only used by the high-stratum servers to get really accurate time using several sources. This is how an ordinary client gets the time, using only one server)
First of all, NTP timestamps are stored as seconds since January 1, 1900. 32 bits for the number of seconds, and 32 bits for the fractions of a second.
The synchronization is tricky. The client stores the timestamp (say A) (all these values are in seconds) when it sends the request. The server sends a reply consisting of the "true" time when it received the packet (call that X) and the "true" time it will transmit the packet (Y). The client will receive that packet and log the time when it received it (B).
NTP assumes that the time spent on the network is the same for sending and receiving. Over enough intervals over sane networks, it should average out to be so. We know that the total transit time from sending the request to receiving the response was B-A seconds. We want to remove the time that the server spent processing the request (Y-X), leaving only the network traversal time, so that's B-A-(Y-X). Since we're assuming the network traversal time is symmetric, the amount of time it took the response to get from the server to the client is [B-A-(Y-X)]/2. So we know that the server sent its response at time Y, and it took us [B-A-(Y-X)]/2 seconds for that response to get to us.
So the true time when we received the response is Y+[B-A-(Y-X)]/2 seconds. And that's how NTP works.
Example (in whole seconds to make the math easy):
Client sends request at "wrong" time 100. A=100.
Server receives request at "true" time 150. X=150.
The server is slow, so it doesn't send out the response until "true" time 160. Y=160.
The client receives the request at "wrong" time 120. B=120.
Client determines the time spend on the network is B-A-(Y-X)=120-100-(160-150)=10 seconds
Client assumes the amount of time it took for the response to get from the server to the client is 10/2=5 seconds.
Client adds that time to the "true" time when the server sent the response to estimate that it received the response at "true" time 165 seconds.
Client now knows that it needs to add 45 seconds to its clock.
In a proper implementation, the client runs as a daemon, all the time. Over a long period of time with many samples, NTP can actually determine if the computer's clock is slow or fast, and automatically adjust it accordingly, allowing it to keep reasonably good time even if it is later disconnected from the network. Together with averaging the responses from the server, and application of more complicated thinking, you can get incredibly accurate times.
There's more, of course, to a proper implementation than that, but that's the gist of it.
The NTP client asks all of its NTP
servers what time it is.
The different servers will give
different answers, with different confidence levels because the
requests will take different amounts
of time to travel from the client to
the server and back.
Marzullo's algorithm will find the smallest
range of time values consistent with
all of the answers provided.
You can be more confident of the accuracy of the answer from this algorithm than of that from any single time servers because the intersection of several sets will likely contain fewer elements than any individual set.
The more servers you query, the more constraints you'll have on the possible answer, and the more accurate your clock will be.
IF you are using timestamps to decide ordering, specific times may not be nessisary. You could use lamport clocks instead, which are less of a pain than network syncronization. It can tell you what came "first", but not the exact difference in times. It doesn't care what the computer's clock actually says.
The trick is that some packets are fast, and the fast packets give you tight constraints on the time.