Low-latency, large-scale message queuing - performance

I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?

Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.

I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.

Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.

FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.


Multiplayer game server: How much is too much communication from the client to the server

I am making a multiplayer game (server/client) with unity and a Colyseus backend. Currently the backend sends 20 updates per second to each client. I want each client to also send approximately 20 messages to the server each second. Is this too much communication? (the messages are very small, a JSON object with 5 string fields).
I don't want to build the game and find out it is not scalable :(. So Thesis: is Each client sending a small message to the server 20 times a second too much?
As mentioned by Slugart, it is best to benchmark and go from there.
That being said, there are a few things you can do if you find the performance to be a bottleneck:
Lower the number of messages - generally, 20 messages per second per client might be a bit too much - games usually go with less than half of that (6-12 msg/s).
Use binary format instead of json - if the server needs to act as a relay, you could encode your messages using binary protocol. Look into protobuf or messagepack.
There are some other options available, but they are not available for javascript (as far as I know).
In case you are expecting a large number of players, and every want to optimize as much as possible, I would suggest switching to a backend that supports multithreading, object pooling (to reduce Garbage Collection time), etc, to gain the most performance.
This depends on many things that you haven't specified, first among those is how many simultaneous and how many server isntances players you expect to have.
I would recommend you quickly benchmark how long the (de)serialisation of your message takes and then multiply it by the actual message volume you expect to see.
You could also create a proof of concept that does nothing except send messages at different messages rates to see yourself how it would scale.

Redis vs Kafka vs RabbitMQ for 1MB messages

I am currently researching a queueing solution to handle medium sized messages of 1MB.
Besides the features differences between Redis, Kafka and RabbitMQ I cannot find any good answer to their performance on messages of size around 1MB.
Any of you guys knows how many messages of 1MB can any of these handle?
Do you know any other queueing solutions which can perform better?
When you are evaluating Kafka vs Redis in your case, there are other factors which you have to take into account, besides message size. Here are some of them I can think of:
How many producers/consumers? Redis performance can be affected in case of greater number of producers/consumers due to the nature of Redis (push based queue). This is because Redis delivers the message to all the consumers at once, at the moment the message is put in the queue.
Do you need speed or reliability first? If speed is of utmost importance, use Redis since it does not persist messages and it will deliver them faster. If you need reliability use Kafka since it persist messages even after they are delivered.
Do you want your consumers to get messages once they are ready or you want messages to be sent to the consumers immediately? In first case use Kafka because it's pull based mechanism (consumer have to ask for the message). In second case use Redis since it's push based mechanism (message is pushed to the consumer once it's on the queue). RabbitMQ is also push based (although there is pull API with bad performance)
What is the number of messages expected? If it's not huge use Redis since you are limited with memory. Otherwise use Kafka. Best practice for RabbitMQ is to keep queues short. This means that you can consume messages at the close rate at which they appear on the queue. So if you have some long lasting operation on the consumer part probably RabbitMQ is not the best choice.
Scaling? Kafka scales horizontally really well (it's built with scalability in mind). RabbitMQ is usually scaled vertically. Redis also scales well horizontally if needed.
It's obvious that there are more than one criteria when you evaluate proper queueing solution. There are best practices and recommendations for each of the queueing engines that you are looking at. Think more about your specific use case, it's definitely worth the time since it will save you time later on if you chose inappropriate queueing engine.
I am answering for Kafka.
Kafka itself has very good performance even for big messages.
In our tests with 2 Kafka nodes we reach p2p communication with 170 MB/sec smaller messages 150 MB/s bigger messages.
The only thing you need to remember is to configure the broker to accept bigger messages.
Hier is nice article: Configuring Kafka for Performance and Resource Management - Handling Large Messages
I know other p2p solution which might be interesting when you have concrete requirements look at YAMI4
I was using Redis but only for very small messages, so I cannot say anything about 1MB.

ZeroMQ pattern for load balancing work across workers based on idleness

I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?
Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.
Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.

MassTransit selective consumers without round tripping

I am looking at using masstransit and have a need for selectively sending messages to consumers at the end if unreliable and slow network links (they are in the same WAN but use a slow and expensive cellular link).
I am expecting a fanout of 1 to 200 where the sites with lowest volume of messages and least reliable / most expensive links need to ignore the potentially high amount of message traffic othe consumers will see
I have looked at using the Selective consumer interface but this seems to imply that the message is always sent to all consumers, and then discarded if it doesn't match the predicate. This overhead is not acceptable.
Without using endpoint factory and manually managing uri end points to do a Send(), is there a nice way to do thus using subscriptions?
Simple answer: nope.
You do have a few options though. Is it just routing based upon load/processing? You could use competing consumers to do load balancing. All the endpoints read off the same queue (but they must be the same consumers on every process reading from the queue) and just pick up the next one. If you're slow, you just pick off fewer messages. (You can only use competing consumers with RabbitMQ).
For MSMQ there's a distributor that was built for load balancing. You could look at rebuilding that on top of RabbitMQ that if that's your transport. It's not super complicated, but would take some effort to do.
Other than that, I think you're likely down to writing something from scratch. It's not really pub/sub any more. So it falls outside MT's wheelhouse.

Active MQ load balancing to achieve high throughput

Currently my activeMQ configuration (non persistent messaging) allows me to achieve 2000 msgs/sec. There are four queues and four consumers consuming the messages. There's only one activeMQ broker in this configuration. I would like to achieve a higher throughput of about 5000 msgs/sec (with addition of additional brokers). I'm pretty clueless on how to achieve this with out splitting individual queues on to individual ActiveMQ instances. What are the topologies that support higher throughput than the individual instance with out splitting the queues among instances ?
Adding a network of brokers might help. That is if you have a decent number of consumers and a decent number of producers connecting to different brokers.
If you have a single producer or a single consumer, all traffic will still go over one of the brokers, making it the bottleneck in any case. So, your actual setup of the servers using the AMQ broker is important.
You will also need to check what's the bottleneck of your physical machines. Is it I/O? CPU? Memory usage/heap size? Even Linkspeed? Use OS tools together with visualvm to track this down. Then you at least know what kind of server you need next.
In any case, some semi-manual load balancing is always possible over several nodes, weather you are using a network of brokers or not. Just make sure messages are routed through certain brokers depending on their content or whatnot. If you cannot distinguish between different message types in any logical way - you can do things like finding some integer number in the message (be it client IP, yesterdays temperature in celsius or whatever), and do a number modulo <num brokers>. Then route it to the destination you selected. Round robin is also an option. There is almost always a way to distribute the load in a logical way among several brokers.
