I plan to make a system for distributing VM images among several stations using BitTorrent protocol. Current system looks as follows:
|-[room with 20PCs]-
[srv_with_images]-->--[1Gbps-bottlneck]-->--|
|-[2nd room with 20PCs]-
All the PCs at once are downloading images through the 1Gbps bottleneck every night and it takes a lot of time. We plan to use BitTorrent to speed up the distribution of images using peer-to-peer exchange between all the PCs. However there is problem - when image appears on the origin server it starts to act as a single seed from whom all peers are downloading the file simultaneously. So we again fall into the trap of the bottleneck. To speed up the distribution we need to implement (at least we think that we need) an abstract high-level algorithm that:
Ensures the on the beggining when new image arrives only small portion of stations will be downloading the image from origin,
When the small portion will start seeding, rest of, or another bigger portion of PCs will start peering, or they will be peering only from the PCs in class, not from origin,
It shouldnt rely on "static" list of initial peers, as some computers may be offline during the day. We cant assume that any of the computers will always be up&running. A peer may also be turned off anytime.
Are there any specific algorithms that can help us desinging this? The most naive way would be to just keep active servers list somewhere and make some daemon that will be choosing initial peers for each torrent. But maybe there are some more elegant ways to do that kind of stuff??
Another option would be to ensure that only some peers ca download from origin, and rest of the peers do download from each other(but not from origin) - is it possible in BitTorrent protocol?
If you are using bittorrent no special coordination is necessary.
Peers behind the bottleneck can directly talk to each other and share the bandwidth. Using the rarest-first piece picking algorithm will mostly ensure that they download different pieces from the server and then share them with each other.
LSD may help to speed up lan-local discovery but it should work with a normal tracker too if there are no NAT shenanigans in play.
When performing AJAX requests, I have always tried to do as few as possible since there is an overhead to each request having to open the http connection to send the data. Since a websocket connection is constantly open, is there any cost outside of the obvious packet bandwidth to sending a request?
For example. Over the space of 1 minute, a client will send 100kb of data to the server. Assuming the client does not need a response to any of these requests, is there any advantage to queuing packets and sending them in one big burst vs sending them as they are ready?
In other words, is there an overhead to the stopping and starting data transfer for a connection that is constantly open?
I want to make a multiplayer browser game as real time as possible, but I don't want to find that 100s of tiny requests per minute compared to a larger consolidated request is causing the server additional stress. I understand that if the client needs a response it will be slower as there is a lot of waiting from the back and forth. I will consider this and only consolidate when it is appropriate. The more smaller requests per minute, the better user experience, but I don't know what toll it will have on the server.
You are correct that a webSocket message will have lower overhead for a given message transmission than sending the same message via an Ajax call because the webSocket connection is already established and because a webSocket message has lower overhead than an HTTP request.
First off, there's always less overhead in sending one larger transmission vs. sending lots of smaller transmissions. That's just the nature of TCP. Every TCP packet gets separately processed and acknowledged so sending more of them costs a bit more overhead. Whether that difference is relevant or significant and worth writing extra code for or worth sacrificing some element of your user experience (because of the delay for batching) depends entirely upon the specifics of a given situation.
Since you've described a situation where your client gets the best experience if there is no delay and no batching of packets, then it seems that what you should do is not implement the batching and test out how your server handles the load with lots of smaller packets when it gets pretty busy. If that works just fine, then stay with the better user experience. If you have issues keeping up with the load, then seriously profile your server and find out where the main bottleneck to performance is (you will probably be surprised about where the bottleneck actually is as it is often not where you think it will be - that's why you have to profile and measure to know where to concentrate your energy for improving the scalability).
FYI, due to the implementation of Nagel's algorithm in most implementations of TCP, the TCP stack itself does small amounts of batching for you if you are sending multiple requests fairly closely spaced in time or if sending over a slower link.
It's also possible to implement a dynamic system where as long as your server is able to keep up, you keep with the smaller and more responsive packets, but if your server starts to get busy, you start batching in order to reduce the number of separate transmissions.
This question is for an indication/hunch. I realize that it may have been discussed before and that there is no good, scientific answer; nevertheless i seek for experienced/qualified opinions, as there are no definite answers to be found. An indication will be valuable as a clue, hence i ask the community to allow a bit of fuzzyness.
Background:
Consider a very-large-area 3D simulation
with n participants (peers, people behind NAT) distributed over multiple cities.
where each participant is seen as one "moving object" in the simulation (hence each moving object is owned by a peer).
where each peer shall see all other moving object correctly (ie. positional updates are needed).
(The entire simulation is larger, so we now focus on one single blob, and consider it to be the entire "world").
Scale:
World/blob size 10x10 kilometers (almost flat world).
Object size: Length max 10 meters
(We omit things like occlusion, optmisations, balancing etc. Assume that all there is needs to be seen and updated).
The nature of "moving object":
it is physically/positionally restless (compare to a boat in big
waves).
it's movement must be sync'ed to all peers (but individual sync does not need to be simultaneus with other syncs).
if X sees one, but does not own it, it will behave well (deterministically, by X's local physics calculation) for maybe 1 second, but after that it will diverge (due to different frame rates) and needs a positional update (a UDP packet) from it's owner.
From a peers point of view:
He needs to update n-1 other peers
He need to receive updates from n-1 other peers
The positional updates are the critical ones, so focus only on those. One update is ca 20-30 doubles, ca. 200 bytes. Consider UDP only.
As i see it, there are two options. The first one is serverless, where everything works solely on peer2peer communication. The second one is having a server (one, for now) in the middle.
1. Serverless, p2p
Each peer must talk with many other peers. One problem is that "Nagle'ing" is useless. First reason is that all endpoints are different, Second is that the local data changes from frame to frame, and there is no point in accumulating multiple frames' data, to send in a larger packet, more sparsely. The oldest frames' data would be outdated. An advantage is however not being dependant on a server.
2. Server-supported
Each peer sends it's info to a high-performance, high-bandwidth server which is able to better receive and distribute to all peers, at a fast rate. Similarly, any peer would receive all peers' data from one endpoint only, the server.
Naurally, each peer runs a game loop.
Question: Hopefully based on some kind of experience, what would you throw as a maximum functional number of peers for case 1, case 2? Thx.
It is difficult to quantify, but for such a level of all-to-all synchronization I would recomend centralized control.
In p2p mode each peer would send n-1 and receive n-1 packets each pseudo-round. In centralized mode they would receive n-1 packets, but would send only 1, spending less time in this task. So centralized mode seems to be more scallable.
A server can check if update messages are consistent before delivering them. In p2p, each peer would have to deal with instable or disconnected peers, which could be better managed by a server.
In centralized mode update-time has to be better carefully choosen, because clients are more susceptible to experience higher latencies, as each packet has to travel towards the server, and then back to the clients. Choosing the best server for clients is one thing to consider.
Combining packets could make the information traverse the network faster, but as outdating of data is an issue, try to make sure each packet is as small as possible, transmission time is smaller in this case.
assume distributed systems network. Each system measures a value. There is a correct decision to be made in consensus by all systems depending on all values. communication links may drop. Is there a voting and synchronization algorithm for this case?
Examples of voting algorithm in distributed systems:
Bully algorithm (http://en.wikipedia.org/wiki/Bully_algorithm)
Chang and Roberts algorithm (http://en.wikipedia.org/wiki/Chang_and_Roberts_algorithm)
I have solved a similar problem. It is a failure detection scheme, so I'll describe it in those terms instead of the generic terms of the OP.
Clients ping our servers periodically, and after some time of no pings the client is considered dead or behind a network partition. (They are the same to us.) Because the clients can pick an arbitrary server to connect to, different servers have different views on if the client is dead or alive.
Our servers use a gossip/epidemic protocol to exchange their view of the clients with each other. This is where the logic comes that one server's data is better than another's. The nice thing about an epidemic protocol is that it is light on the network, yet will still converge.
When a decision is made (in our case, declaring that a client is dead) any of the servers has a tolerably up-to-date table of all the client's heartbeats. Any server is free to make the decision, which we do via a consensus protocol amongst themselves (Paxos or Raft). Note that the server may be wrong with its decision—but it's unlikely that it doesn't have a somewhat up-to-date table but still run a successful Paxos round.
I'm going through a bit of a re-think of large-scale multiplayer games in the age of Facebook applications and cloud computing.
Suppose I were to build something on top of existing open protocols, and I want to serve 1,000,000 simultaneous players, just to scope the problem.
Suppose each player has an incoming message queue (for chat and whatnot), and on average one more incoming message queue (guilds, zones, instances, auction, ...) so we have 2,000,000 queues. A player will listen to 1-10 queues at a time. Each queue will have on average maybe 1 message per second, but certain queues will have much higher rate and higher number of listeners (say, a "entity location" queue for a level instance). Let's assume no more than 100 milliseconds of system queuing latency, which is OK for mildly action-oriented games (but not games like Quake or Unreal Tournament).
From other systems, I know that serving 10,000 users on a single 1U or blade box is a reasonable expectation (assuming there's nothing else expensive going on, like physics simulation or whatnot).
So, with a crossbar cluster system, where clients connect to connection gateways, which in turn connect to message queue servers, we'd get 10,000 users per gateway with 100 gateway machines, and 20,000 message queues per queue server with 100 queue machines. Again, just for general scoping. The number of connections on each MQ machine would be tiny: about 100, to talk to each of the gateways. The number of connections on the gateways would be alot higher: 10,100 for the clients + connections to all the queue servers. (On top of this, add some connections for game world simulation servers or whatnot, but I'm trying to keep that separate for now)
If I didn't want to build this from scratch, I'd have to use some messaging and/or queuing infrastructure that exists. The two open protocols I can find are AMQP and XMPP. The intended use of XMPP is a little more like what this game system would need, but the overhead is quite noticeable (XML, plus the verbose presence data, plus various other channels that have to be built on top). The actual data model of AMQP is closer to what I describe above, but all the users seem to be large, enterprise-type corporations, and the workloads seem to be workflow related, not real-time game update related.
Does anyone have any daytime experience with these technologies, or implementations thereof, that you can share?
#MSalters
Re 'message queue':
RabbitMQ's default operation is exactly what you describe: transient pubsub. But with TCP instead of UDP.
If you want guaranteed eventual delivery and other persistence and recovery features, then you CAN have that too - it's an option. That's the whole point of RabbitMQ and AMQP -- you can have lots of behaviours with just one message delivery system.
The model you describe is the DEFAULT behaviour, which is transient, "fire and forget", and routing messages to wherever the recipients are. People use RabbitMQ to do multicast discovery on EC2 for just that reason. You can get UDP type behaviours over unicast TCP pubsub. Neat, huh?
Re UDP:
I am not sure if UDP would be useful here. If you turn off Nagling then RabbitMQ single message roundtrip latency (client-broker-client) has been measured at 250-300 microseconds. See here for a comparison with Windows latency (which was a bit higher) http://old.nabble.com/High%28er%29-latency-with-1.5.1--p21663105.html
I cannot think of many multiplayer games that need roundtrip latency lower than 300 microseconds. You could get below 300us with TCP. TCP windowing is more expensive than raw UDP, but if you use UDP to go faster, and add a custom loss-recovery or seqno/ack/resend manager then that may slow you down again. It all depends on your use case. If you really really really need to use UDP and lazy acks and so on, then you could strip out RabbitMQ's TCP and probably pull that off.
I hope this helps clarify why I recommended RabbitMQ for Jon's use case.
I am building such a system now, actually.
I have done a fair amount of evaluation of several MQs, including RabbitMQ, Qpid, and ZeroMQ. The latency and throughput of any of those are more than adequate for this type of application. What is not good, however, is queue creation time in the midst of half a million queues or more. Qpid in particular degrades quite severely after a few thousand queues. To circumvent that problem, you will typically have to create your own routing mechanisms (smaller number of total queues, and consumers on those queues are getting messages that they don't have an interest in).
My current system will probably use ZeroMQ, but in a fairly limited way, inside the cluster. Connections from clients are handled with a custom sim. daemon that I built using libev and is entirely single-threaded (and is showing very good scaling -- it should be able to handle 50,000 connections on one box without any problems -- our sim. tick rate is quite low though, and there are no physics).
XML (and therefore XMPP) is very much not suited to this, as you'll peg the CPU processing XML long before you become bound on I/O, which isn't what you want. We're using Google Protocol Buffers, at the moment, and those seem well suited to our particular needs. We're also using TCP for the client connections. I have had experience using both UDP and TCP for this in the past, and as pointed out by others, UDP does have some advantage, but it's slightly more difficult to work with.
Hopefully when we're a little closer to launch, I'll be able to share more details.
Jon, this sounds like an ideal use case for AMQP and RabbitMQ.
I am not sure why you say that AMQP users are all large enterprise-type corporations. More than half of our customers are in the 'web' space ranging from huge to tiny companies. Lots of games, betting systems, chat systems, twittery type systems, and cloud computing infras have been built out of RabbitMQ. There are even mobile phone applications. Workflows are just one of many use cases.
We try to keep track of what is going on here:
http://www.rabbitmq.com/how.html (make sure you click through to the lists of use cases on del.icio.us too!)
Please do take a look. We are here to help. Feel free to email us at info#rabbitmq.com or hit me on twitter (#monadic).
My experience was with a non-open alternative, BizTalk. The most painful lesson we learnt is that these complex systems are NOT fast. And as you figured from the hardware requirements, that translates directly into significant costs.
For that reason, don't even go near XML for the core interfaces. Your server cluster will be parsing 2 million messages per second. That could easily be 2-20 GB/sec of XML! However, most messages will be for a few queues, while most queues are in fact low-traffic.
Therefore, design your architecture so that it's easy to start with COTS queue servers and then move each queue (type) to a custom queue server when a bottleneck is identified.
Also, for similar reasons, don't assume that a message queue architecture is the best for all comminication needs your application has. Take your "entity location in an instance" example. This is a classic case where you don't want guaranteed message delivery. The reason that you need to share this information is because it changes all the time. So, if a message is lost, you don't want to spend time recovering it. You'd only send the old locatiom of the affected entity. Instead, you'd want to send the current location of that entity. Technology-wise this means you want UDP, not TCP and a custom loss-recovery mechanism.
FWIW, for cases where intermediate results are not important (like positioning info) Qpid has a "last-value queue" that can deliver only the most recent value to a subscriber.