How do udp sockets actually work internally?

How do udp sockets actually work internally? - performance

I am trying to reduce packets manipulation to its minimum in order to improve efficiency of a specific program i am working on but i am struggling with the time it takes to send through a udp socket using sendto/recvfrom. I am using 2 very basic processes (applications), one is sending, the other one receiving.
I am willing to understand how linux internally works when using these function calls...
Here are my observations:
when sending packets at:
10Kbps, the time it takes for the messages to go from one application to the other is about 28us
400Kbps, the time it takes for the messages to go from one application to the other is about 25us
4Mbps, the time it takes for the messages to go from one application to the other is about 20us
40Mbps, the time it takes for the messages to go from one application to the other is about 18us
When using different CPUs, time is obviously different and consistent with those observations. There must be some sort of setting that enables some queue readings to be done faster depending on the traffic flow on a socket... how can that be controlled?
When using a node as a forwarding node only, going in and out takes about 8us when using 400Kbps flow, i want to converge to this value as much as i can. 25us is not acceptable and deemed to slow (it is obvious that this is way less than the delay between each packet anyway... but the point is to be able to eventually have a greater deal of packets to be processed, hence, this time needs to be shortened!). Is there anything faster than sendto/recvfrom (must use 2 different applications (processes), i know i cannot use a monolitic block, thus i need info to be sent on a socket)?

Related

Multiplayer game server: How much is too much communication from the client to the server

I am making a multiplayer game (server/client) with unity and a Colyseus backend. Currently the backend sends 20 updates per second to each client. I want each client to also send approximately 20 messages to the server each second. Is this too much communication? (the messages are very small, a JSON object with 5 string fields).
I don't want to build the game and find out it is not scalable :(. So Thesis: is Each client sending a small message to the server 20 times a second too much?

As mentioned by Slugart, it is best to benchmark and go from there.
That being said, there are a few things you can do if you find the performance to be a bottleneck:
Lower the number of messages - generally, 20 messages per second per client might be a bit too much - games usually go with less than half of that (6-12 msg/s).
Use binary format instead of json - if the server needs to act as a relay, you could encode your messages using binary protocol. Look into protobuf or messagepack.
There are some other options available, but they are not available for javascript (as far as I know).
In case you are expecting a large number of players, and every want to optimize as much as possible, I would suggest switching to a backend that supports multithreading, object pooling (to reduce Garbage Collection time), etc, to gain the most performance.
Disclaimer: I am a co-founder of ServerBytes - we help you make games faster.
You can also try ServerBytes for free - a platform which supports high concurrency, high throughput, custom c# backend code and more.

This depends on many things that you haven't specified, first among those is how many simultaneous and how many server isntances players you expect to have.
I would recommend you quickly benchmark how long the (de)serialisation of your message takes and then multiply it by the actual message volume you expect to see.
You could also create a proof of concept that does nothing except send messages at different messages rates to see yourself how it would scale.

Socket.io: How to reduce emit delay with many concurrent connections?

Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit

OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).

What additional overheads are there to sending a packet over a websocket connection?

When performing AJAX requests, I have always tried to do as few as possible since there is an overhead to each request having to open the http connection to send the data. Since a websocket connection is constantly open, is there any cost outside of the obvious packet bandwidth to sending a request?
For example. Over the space of 1 minute, a client will send 100kb of data to the server. Assuming the client does not need a response to any of these requests, is there any advantage to queuing packets and sending them in one big burst vs sending them as they are ready?
In other words, is there an overhead to the stopping and starting data transfer for a connection that is constantly open?
I want to make a multiplayer browser game as real time as possible, but I don't want to find that 100s of tiny requests per minute compared to a larger consolidated request is causing the server additional stress. I understand that if the client needs a response it will be slower as there is a lot of waiting from the back and forth. I will consider this and only consolidate when it is appropriate. The more smaller requests per minute, the better user experience, but I don't know what toll it will have on the server.

You are correct that a webSocket message will have lower overhead for a given message transmission than sending the same message via an Ajax call because the webSocket connection is already established and because a webSocket message has lower overhead than an HTTP request.
First off, there's always less overhead in sending one larger transmission vs. sending lots of smaller transmissions. That's just the nature of TCP. Every TCP packet gets separately processed and acknowledged so sending more of them costs a bit more overhead. Whether that difference is relevant or significant and worth writing extra code for or worth sacrificing some element of your user experience (because of the delay for batching) depends entirely upon the specifics of a given situation.
Since you've described a situation where your client gets the best experience if there is no delay and no batching of packets, then it seems that what you should do is not implement the batching and test out how your server handles the load with lots of smaller packets when it gets pretty busy. If that works just fine, then stay with the better user experience. If you have issues keeping up with the load, then seriously profile your server and find out where the main bottleneck to performance is (you will probably be surprised about where the bottleneck actually is as it is often not where you think it will be - that's why you have to profile and measure to know where to concentrate your energy for improving the scalability).
FYI, due to the implementation of Nagel's algorithm in most implementations of TCP, the TCP stack itself does small amounts of batching for you if you are sending multiple requests fairly closely spaced in time or if sending over a slower link.
It's also possible to implement a dynamic system where as long as your server is able to keep up, you keep with the smaller and more responsive packets, but if your server starts to get busy, you start batching in order to reduce the number of separate transmissions.

ZeroMQ pattern for load balancing work across workers based on idleness

I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
1) REQ/REP
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
Problem:
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
2) PUSH/PULL
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Problem:
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
3) PUB/SUB
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Problem:
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Problem:
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?

Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.

Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Latency
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
CSP
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.

Data stream, which way to go?

I am about to start a HTML5 game, with heavy logic in java script, I want to keep some logic at the server side, so that I guarantee that my game will play only at my server.
I decided to chose node.js, as its very fast, I thought about two ways:
To use AJAX, client side will call a server side method which will return calculated numbers to refresh the game scene, this call will be called every 2 second.
To open a socket using node.js, so that client don't have to call the server each time, instead, it keep listening to data streamed from the opened socket, which will refresh data every x seconds.
The calculated data is not big, its about 0.5 kb per one second, client also needs to tell server what's the status, so data sent from client is about 0.1 kb / x second, depends on game play.
It seems that the second approach is better, but, I will need hundred of ports to handle concurrent players ..
So, in term of performance & minimizing used bandwidth, which way to chose? or, is there even a better way? any one can help?

As you mentioned you are creating a web-based JavaScript application that regularly sends information to, or retrieves updates from, a server then in my opinion you should use WebSocket(especially you are developing in HTML5), which reduce the amount of bandwidth your application uses.
In term of performance, I would chose WebSocket aswell, by doing some measurement experiments e.g averaging a round trip time for 100 requests at a time, WebSocket has a lower round trip time. Here is a link of a performance test might tell the result: http://www.peterbe.com/plog/are-websockets-faster-than-ajax

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio