phoenix channels and their relation to sockets - performance

I need some advice about elixir/phoenix channels. I have an application that is related to venue changes and in order to reduce the amount of data sent to each client I only want each client to subscribe to the venues it cares about.
With this in mind I was thinking of going down the route of having a channel for "VenueChanges/*" and having each client subscribe to the channel several times with each venue id it cares about i.e. "VenueChanges/1", "VenueChanges/2" etc.
The venues that the client care about will change frequently which will mean a lot of joining and leaving channels.
My question is, what's the overhead of having a client join a channel lots of times. Am I correct in assuming that there would still only be one socket open and not a new socket for each of the channels joined?
Also any advice on managing the constant joining and leaving of channels from the client? Any other advice in general? If this is a bad idea what are better alternatives?

With respect to the socket question, you are correct in that you will still only have one socket per client (multiple channels are multiplexed over that one socket).
While not directly answering your consistent join/leave question, Chris McCord's post on Phoenix Channels vs Rails Action Cable has some really good data on performance best summarized by:
With Phoenix, we've shown that channels performance remains consistent
as notification demand increases, which is essential for handling
traffic spikes and avoiding overload
That said, your server hardware and deployment distribution strategy would also play a significant role in answering that concern.
Lastly, on the basis that you meant join/leaving channel topics (or "rooms" as it's termed in some places) as seen in the Chris's test with 55,000 connections:
It's important to note that Phoenix maintains the same responsiveness when broadcasting for both the 50 and 200 users per room tests.

Related

How to manage repeated requests on a cached server while the result arrives

In the context of a highly requested web service written in go language, I am considering to cache some computations. For that, I am thinking to use Redis.
My application is susceptible to receiving an avalanche of requests containing the same payload that triggers a costly computation. So a cache would reward and allow to compute once.
Consider the following figure extracted from here
I use this figure because I think it helps me illustrate the problem. The figure considers the general two cases: the book is in the cache, or this one is not in. However, the figure does not consider the transitory case when a book is being retrieved from the database and other "get-same-book" requests arrive. In this case, I would like to queue the repeated requests temporarily until the book is retrieved. Next, once the book has already arrived, the queued requests are replied with the result which would remain in the cache for fast retrieving of future requests.
So my question asks for approaches for implementing this requirement. I'm considering to use a kind of table on the server (repository) that writes the status of a query database (computing, ready), but this seems a little complicated, because I would need to handle some race conditions.
So I would like to know if anyone knows this pattern or if Redis itself implements it in some way (I have not found it in my consultations, but I suspect that using a Redis lock would be possible)
You can design it as you have described. But there is some things that are important.
Use a unique key
Use an unique key for each book, and if the book is ever changed, that key should also change. This design makes your step (6) save the book in Redis an idempotent operation (you can do it many times with the same result). So you will avoid any race condition with "get-same-book".
Idempotent requests OR asynchronous messages
I would like to queue the repeated requests temporarily until the book is retrieved. Next, once the book has already arrived, the queued requests are replied with the result
I would not recommend to queue requests as you describe it. If the request is a cache-miss, let it retrieve it from the database - but design it idempotent. Alternatively, you should handle all requests as asynchronous, and use a message queue e.g. nats, RabbitMQ or something, but the complexity grows with that solution.
Serializing requests
My problem is that while that second of computation where the result is not still gotten, too many repeated requests can arrive and due to the cost I need to avoid to repeat their computations. I need to find a way of retaining them while the result of the first request arrives.
It sounds like you want to have your computations serialized instead of doing them concurrently because you want to avoid doing the same computation twice. To solve this, you should let the requests initialize the computation, e.g. by putting the input on a queue and then do the computation in a serial order (but still possibly concurrently if they have a different key) and finally notify the client, or if the client is subscribing for updates (a better solution).
Redis do have support for PubSub but it depends on what requirements you have on the clients. I would recommend a solution without locks, for scalability.

Advice on pubsub topic division based on geohashes for ably websocket connection service

My question concerns the following use case:
Use case actors
User A: The user who sets a broadcast region and views stream with live posts.
User B: The first user who sends a broadcast message from within the broadcast region set by user A.
User C: The second user who sends a broadcast message from within the broadcast region set by user A.
Use case description
User A selects a broadcast region within which boundaries (radius) (s)he wants to receive live broadcast messages.
User A opens the livefeed and requests an initial set of livefeed items.
User B broadcasts a message from within the broadcast region of user A while user A’s livefeed is still open.
A label with 1 new livefeed item appears at the top of User A’s livefeed while it is open.
As user C publishes another livefeed post from within the selected broadcast region from user A, the label counter increments.
User A receives a notification similar to this example of Facebook:
The solution I thought to apply (and which I think Pubnub uses), is to create a topic per geohash.
In my case that would mean that for every user who broadcasted a message, it needs to be published to the geohash-topic, and clients (app / website users) would consume the geohash-topic through a websocket if it fell within the range of the defined area (radius). Ably seems to provide this kind of scalable service using web sockets.
I guess it would simplified be something like this:
So this means that a geohash needs to be extracted from the current location from where the broadcast message is sent. This geohash should have granular scale that is small enough so that the receiving user can set a broadcast region that is more or less accurate. (I.e. the geohash should have enough accuracy if we want to allow users to define a broadcast region within which to receive live messages, which means that one should expect a quite large amount of topics if we decided to scale).
Option 2 would be to create topics for a geohash that has a less specific granularity (covering a larger area), and let clients handle the accuracy based on latlng values that are sent along with the message.
The client would then decide whether or not to drop messages. However, this means more messages are sent (more overhead), and a higher cost.
I don't have experience with this kind of architecture, and question the viability / scalability of this approach.
Could you think of an alternate solution to this question to achieve the desired result or provide more insight on how to solve this kind of problem overall? (I also considered using regular req-res flow, but this means spamming the server, which also doesn't seem like a very good solution).
I actually checked.
Given a region of 161.4 km² (like region Brussels), the division of geohashes by length of the string is as follows:
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
6 ≤ 1.22km × 0.61km
7 ≤ 153m × 153m
8 ≤ 38.2m × 19.1m
9 ≤ 4.77m × 4.77m
10 ≤ 1.19m × 0.596m
11 ≤ 149mm × 149mm
12 ≤ 37.2mm × 18.6mm
Given that we would allow users to have a possible inaccuracy up to 153m (on the region to which users may want to subscribe to receive local broadcast messages), it would require an amount of topics that is definitely already too large to even only cover the entire region of Brussels.
So I'm still a bit stuck at this level currently.
1. PubNub
PubNub is currently the only service that offers an out of the box geohash pub-sub solution over websockets, but their pricing is extremely high (500 connected devices cost about 49$, 20k devices cost 799$) UPDATE: PubNub has updated price, now with unlimited devices. Website updates coming soon.
Pubnub is working on their pricing model because some of their customers were paying a lot for unexpected spikes in traffic.
However, it will not be a viable solution for a generic broadcasting messaging app that is meant to be open for everybody, and for which traffic is therefore very highly unpredictable.
This is a pity, since this service would have been the perfect solution for us otherwise.
2. Ably
Ably offers a pubsub system to stream data to clients over websockets for custom channels. Channels are created dynamically when a client attaches itself in order to either publish or subscribe to that channel.
The main problem here is that:
If we want high geohash accuracy, we need a high number of channels and hence we have to pay more;
If we go with low geohash accuracy, there will be a lot of redundant messaging:
Let's say that we take a channel that is represented by a geohash of 4 characters, spanning a geographical area of 39.1 x 19.5 km.
Any post that gets sent to that channel, would be multiplexed to everybody within that region who is currently listening.
However, let's say that our app allows for a maximum radius of 10km, and half of the connected users has its setting to a 1km radius.
This means that all posts outside of that 2km radius will be multiplexed to these users unnecessarily, and will just be dropped without having any further use.
We should also take into account the scalability of this approach. For every geohash that either producer or consumer needs, another channel will be created.
It is definitely more expensive to have an app that requires topics based on geohashes worldwide, than an app that requires only theme-based topics.
That is, on world-wide adoption, the number of topics increases dramatically, hence will the price.
Another consideration is that our app requires an additional number of channels:
By geohash and group: Our app allows the possibility to create geolocation based groups (which would be the equivalent of Twitter like #hashtags).
By place
By followed users (premium feature)
There are a few optimistic considerations to this approach despite:
Streaming is only required when the newsfeed is active:
when the user has a browser window open with our website +
when the user is on a mobile device, and actively has the related feed open
Further optimisation can be done, e.g. only start streaming as from 10
to 20 seconds after refresh of the feed
Streaming by place / followed users may have high traffic depending on current activity, but many place channels will be idle as well
A very important note in this regard is how Ably bills its consumers, which can be used to our full advantage:
A channel is opened when any of the following happens:
A message is published on the channel via REST
A realtime client attaches to the channel. The channel remains active for the entire time the client is attached to that channel, so
if you connect to Ably, attach to a channel, and publish a message but
never detach the channel, the channel will remain active for as long
as that connection remains open.
A channel that is open will automatically close when all of the
following conditions apply:
There are no more realtime clients attached to the channel At least
two minutes has passed since the last message was published. We keep
channels alive for two minutes to ensure that we can provide
continuity on the channel as part of our connection state recovery.
As an example, if you have 10,000 users, and at your busiest time of
the month there is a single spike where 500 customers establish a
realtime connection to Ably and each attach to one unique channel and
one global shared channel, the peak number of channels would be the
sum of the 500 unique channels per client and the one global shared
channel i.e. 501 peak channels. If throughout the month each of those
10,000 users connects and attaches to their own unique channel, but
not necessarily at the same time, then this does not affect your peak
channel count as peak channels is the concurrent number of channels
open at any point of time during that month.
Optimistic conclusion
The most important conclusion is that we should consider that this feature may not be as crucial as believe it is for a first version of the app.
Although Twitter, Facebook, etc offer this feature of receiving live updates (and users have grown to expect it), an initial beta of our app on a limited scale can work without, i.e. the user has to refresh in order to receive new updates.
During a first launch of the app, statistics can ba gathered to gain more insight into detailed user behaviour. This will enable us to build more solid infrastructural and financial reflections based on factual data.
Putting aside the question of Ably, Pubnub and a DIY solution, the core of the question is this:
Where is message filtering taking place?
There are three possible solution:
The Pub/Sub service.
The Server (WebSocket connection handler).
Client side (the client's device).
Since this is obviously a mobile oriented approach, client side message filtering is extremely rude, as it increases data consumption by the client while much of the data might be irrelevant.
Client side filtering will also increase battery consumption and will likely result in lower acceptance rates by clients.
This leaves pub/sub filtering (channel names / pattern matching) and server-side filtering.
Pub/Sub channel name filtering
A single pub/sub service serves a number of servers (if not all of them), making it a very expensive resource (relative to the resources we have at hand).
Using channel names to filter messages would be ideal - as long as the filtering is cheap (using exact matches with channel name hash mapping).
However, pattern matching (when subscribing to channels with inexact names, such as "users.*") is very expansive when compared to exact pattern matching.
This means that Pub/Sub channel name filtering can't be used to filter all the messages without overloading the pub/sub system.
Server side filtering
Since a server accepts WebSocket connections and bridges between the WebSocket and the pub/sub service, it's in an ideal position to filter the messages.
However, we don't want the server to process all the messages for all the clients for each connection, as this is an extreme duplication of effort.
Hybrid solution
A classic solution would divide the earth into manageable sections (1 sq. km per section will require 510.1 million unique channel names for full coverage... but I would suggest that the 70% ocean space should be neglected).
Busy sections might be subdivided (NYC might require a section per 250 sq meters rather than 1 sq kilometer).
This allows publishers to publish to exact channel names and subscribers to subscribe to exact channel names.
Publishers might need to publish to more than one channel and subscribers might need to subscribe to more than one channel, depending on their exact location and the grid's borders.
This filtering scheme will filter much, but not all.
The server node will need to look into the message, review it's exact geo-location and filter messages before deciding if they should be sent along the WebSocket connection to the client.
Why the Hybrid Solution?
This allows the system to scale with relative ease.
Since server nodes are (by design) cheaper than the pub/sub service, they could be used to handle the exact location filtering (the heavy work).
At the same time, the strength of the pub/sub system can be used to minimize the server's workload and filter the obvious mis-matches.
Pubnub vs. Ably?
I don't know. I didn't use either of them. I worked with Redis and implemented my own pub/sub solution.
I assume they are both great and it's really up to your needs.
Personally I prefer the DIY approach when it comes to customized or complex situations. IMHO, this seems like it would fall into the DIY category if I were to implement it.

ZeroMQ pattern for load balancing work across workers based on idleness

I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
1) REQ/REP
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
Problem:
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
2) PUSH/PULL
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Problem:
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
3) PUB/SUB
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Problem:
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Problem:
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?
Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.
Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Latency
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
CSP
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.

MassTransit selective consumers without round tripping

I am looking at using masstransit and have a need for selectively sending messages to consumers at the end if unreliable and slow network links (they are in the same WAN but use a slow and expensive cellular link).
I am expecting a fanout of 1 to 200 where the sites with lowest volume of messages and least reliable / most expensive links need to ignore the potentially high amount of message traffic othe consumers will see
I have looked at using the Selective consumer interface but this seems to imply that the message is always sent to all consumers, and then discarded if it doesn't match the predicate. This overhead is not acceptable.
Without using endpoint factory and manually managing uri end points to do a Send(), is there a nice way to do thus using subscriptions?
Simple answer: nope.
You do have a few options though. Is it just routing based upon load/processing? You could use competing consumers to do load balancing. All the endpoints read off the same queue (but they must be the same consumers on every process reading from the queue) and just pick up the next one. If you're slow, you just pick off fewer messages. (You can only use competing consumers with RabbitMQ).
For MSMQ there's a distributor that was built for load balancing. You could look at rebuilding that on top of RabbitMQ that if that's your transport. It's not super complicated, but would take some effort to do.
Other than that, I think you're likely down to writing something from scratch. It's not really pub/sub any more. So it falls outside MT's wheelhouse.

Low-latency, large-scale message queuing

I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?
#MSalters
Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
Re UDP:
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.
I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.
Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
Please do take a look. We are here to help. Feel free to email us at info#rabbitmq.com or hit me on twitter (#monadic).
My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.
FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.

Resources