Qualified estimation of nr of peers in a very-large-area simulation? - performance

This question is for an indication/hunch. I realize that it may have been discussed before and that there is no good, scientific answer; nevertheless i seek for experienced/qualified opinions, as there are no definite answers to be found. An indication will be valuable as a clue, hence i ask the community to allow a bit of fuzzyness.
Background:
Consider a very-large-area 3D simulation
with n participants (peers, people behind NAT) distributed over multiple cities.
where each participant is seen as one "moving object" in the simulation (hence each moving object is owned by a peer).
where each peer shall see all other moving object correctly (ie. positional updates are needed).
(The entire simulation is larger, so we now focus on one single blob, and consider it to be the entire "world").
Scale:
World/blob size 10x10 kilometers (almost flat world).
Object size: Length max 10 meters
(We omit things like occlusion, optmisations, balancing etc. Assume that all there is needs to be seen and updated).
The nature of "moving object":
it is physically/positionally restless (compare to a boat in big
waves).
it's movement must be sync'ed to all peers (but individual sync does not need to be simultaneus with other syncs).
if X sees one, but does not own it, it will behave well (deterministically, by X's local physics calculation) for maybe 1 second, but after that it will diverge (due to different frame rates) and needs a positional update (a UDP packet) from it's owner.
From a peers point of view:
He needs to update n-1 other peers
He need to receive updates from n-1 other peers
The positional updates are the critical ones, so focus only on those. One update is ca 20-30 doubles, ca. 200 bytes. Consider UDP only.
As i see it, there are two options. The first one is serverless, where everything works solely on peer2peer communication. The second one is having a server (one, for now) in the middle.
1. Serverless, p2p
Each peer must talk with many other peers. One problem is that "Nagle'ing" is useless. First reason is that all endpoints are different, Second is that the local data changes from frame to frame, and there is no point in accumulating multiple frames' data, to send in a larger packet, more sparsely. The oldest frames' data would be outdated. An advantage is however not being dependant on a server.
2. Server-supported
Each peer sends it's info to a high-performance, high-bandwidth server which is able to better receive and distribute to all peers, at a fast rate. Similarly, any peer would receive all peers' data from one endpoint only, the server.
Naurally, each peer runs a game loop.
Question: Hopefully based on some kind of experience, what would you throw as a maximum functional number of peers for case 1, case 2? Thx.

It is difficult to quantify, but for such a level of all-to-all synchronization I would recomend centralized control.
In p2p mode each peer would send n-1 and receive n-1 packets each pseudo-round. In centralized mode they would receive n-1 packets, but would send only 1, spending less time in this task. So centralized mode seems to be more scallable.
A server can check if update messages are consistent before delivering them. In p2p, each peer would have to deal with instable or disconnected peers, which could be better managed by a server.
In centralized mode update-time has to be better carefully choosen, because clients are more susceptible to experience higher latencies, as each packet has to travel towards the server, and then back to the clients. Choosing the best server for clients is one thing to consider.
Combining packets could make the information traverse the network faster, but as outdating of data is an issue, try to make sure each packet is as small as possible, transmission time is smaller in this case.

Related

which consensus algorithm is synchronous in nature?

There are different consensus algorithm, which are used in permission-oriented blockchain, such as
PAXOS
RAFT
Byzantine General Model
Which of the consensus algorithms are synchronous and asynchronous and why ? Please explain in detail.
Thanks
*I am not an expert on distributed systems still i will try to answer your question.
In distributed systems, People use an underlying model that assumes some properties about time (“how long will it take for this message to arrive?”) and some properties about the types of faults (“how can nodes in the protocol do the wrong thing?”).
There are three main types of timing models usually used for distributed systems the synchronous model, the asynchronous model and the partially synchronous model. Each of these models makes some guarantees about the length of time (“latency”) that can occur between the exchange of messages amongst nodes in a given round of the protocol execution. This categorization is important because in the distributed setting a single node cannot distinguish between a peer node that has failed and a peer node that is just taking a long time to respond.
In the synchronous model, there is some maximum value (“upper bound”) T on the time between when a node sends a message and when you can be certain that the receiving node hears the message. You also have an upper bound P on the relative difference in speed between nodes (so you can account for machines with slow processors).
In the asynchronous model, we remove both upper bounds T and P. Messages can take arbitrarily long to reach peers and each node can take an arbitrary amount of time to respond. When we say arbitrary, we include “infinity” meaning that it takes forever for some event to occur.
The partially synchronous model in a mix of the two: upper bounds exist for T and P but the protocol designer does not know them and the task is designing mechanisms that still come to consensus in light of this fact. In practice, protocol implementers can achieve systems resembling this model given the realistic characteristics of modern networks/machines (messages usually get where they are going) and use of tactics like timeouts to indicate when a node should retry sending a message.
Keeping in mind the above facts, Both Paxos and Raft belongs to the partial synchronous models.
The Byzantine Generals’ Problem is a classic problem faced by any distributed computer system network. Aim is to maintain same state on all participant nodes in presence of malicious nodes.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
Things fail. You can never count on anything being reliable. Even if you have
perfectly bug-free software, and hardware that never breaks, you’ve still got
to deal with the fact that network connections can break, or messages within a
network can get lost, or that some bozo might sever your network connection
with a bulldozer. (That really happened while I was at Google!)
Given (1), you can never rely on one copy of anything, because that copy might
become unavailable due to a failure. So you need to keep multiple copies, and
those copies need to be consistent – meaning that at any time, all of the
copies agree about their contents.
There’s no way to maintain a single completely consistent view of time between
multiple computers. Due to inconsistencies in individual machine performance,
and variable network delays, variable storage latency, and several other
factors, there’s no canonical way of saying that for two events X and Y, “X
happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way(byzantine failures may be because of malware).
To solve this problem we have consensus algorithm with the aim to make all participants to agree on the same state.
Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available.
Paxos and Raft are consensus algorithms which solves byzantine general problem in distributed networks public or private.

Advice on pubsub topic division based on geohashes for ably websocket connection service

My question concerns the following use case:
Use case actors
User A: The user who sets a broadcast region and views stream with live posts.
User B: The first user who sends a broadcast message from within the broadcast region set by user A.
User C: The second user who sends a broadcast message from within the broadcast region set by user A.
Use case description
User A selects a broadcast region within which boundaries (radius) (s)he wants to receive live broadcast messages.
User A opens the livefeed and requests an initial set of livefeed items.
User B broadcasts a message from within the broadcast region of user A while user A’s livefeed is still open.
A label with 1 new livefeed item appears at the top of User A’s livefeed while it is open.
As user C publishes another livefeed post from within the selected broadcast region from user A, the label counter increments.
User A receives a notification similar to this example of Facebook:
The solution I thought to apply (and which I think Pubnub uses), is to create a topic per geohash.
In my case that would mean that for every user who broadcasted a message, it needs to be published to the geohash-topic, and clients (app / website users) would consume the geohash-topic through a websocket if it fell within the range of the defined area (radius). Ably seems to provide this kind of scalable service using web sockets.
I guess it would simplified be something like this:
So this means that a geohash needs to be extracted from the current location from where the broadcast message is sent. This geohash should have granular scale that is small enough so that the receiving user can set a broadcast region that is more or less accurate. (I.e. the geohash should have enough accuracy if we want to allow users to define a broadcast region within which to receive live messages, which means that one should expect a quite large amount of topics if we decided to scale).
Option 2 would be to create topics for a geohash that has a less specific granularity (covering a larger area), and let clients handle the accuracy based on latlng values that are sent along with the message.
The client would then decide whether or not to drop messages. However, this means more messages are sent (more overhead), and a higher cost.
I don't have experience with this kind of architecture, and question the viability / scalability of this approach.
Could you think of an alternate solution to this question to achieve the desired result or provide more insight on how to solve this kind of problem overall? (I also considered using regular req-res flow, but this means spamming the server, which also doesn't seem like a very good solution).
I actually checked.
Given a region of 161.4 km² (like region Brussels), the division of geohashes by length of the string is as follows:
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
6 ≤ 1.22km × 0.61km
7 ≤ 153m × 153m
8 ≤ 38.2m × 19.1m
9 ≤ 4.77m × 4.77m
10 ≤ 1.19m × 0.596m
11 ≤ 149mm × 149mm
12 ≤ 37.2mm × 18.6mm
Given that we would allow users to have a possible inaccuracy up to 153m (on the region to which users may want to subscribe to receive local broadcast messages), it would require an amount of topics that is definitely already too large to even only cover the entire region of Brussels.
So I'm still a bit stuck at this level currently.
1. PubNub
PubNub is currently the only service that offers an out of the box geohash pub-sub solution over websockets, but their pricing is extremely high (500 connected devices cost about 49$, 20k devices cost 799$) UPDATE: PubNub has updated price, now with unlimited devices. Website updates coming soon.
Pubnub is working on their pricing model because some of their customers were paying a lot for unexpected spikes in traffic.
However, it will not be a viable solution for a generic broadcasting messaging app that is meant to be open for everybody, and for which traffic is therefore very highly unpredictable.
This is a pity, since this service would have been the perfect solution for us otherwise.
2. Ably
Ably offers a pubsub system to stream data to clients over websockets for custom channels. Channels are created dynamically when a client attaches itself in order to either publish or subscribe to that channel.
The main problem here is that:
If we want high geohash accuracy, we need a high number of channels and hence we have to pay more;
If we go with low geohash accuracy, there will be a lot of redundant messaging:
Let's say that we take a channel that is represented by a geohash of 4 characters, spanning a geographical area of 39.1 x 19.5 km.
Any post that gets sent to that channel, would be multiplexed to everybody within that region who is currently listening.
However, let's say that our app allows for a maximum radius of 10km, and half of the connected users has its setting to a 1km radius.
This means that all posts outside of that 2km radius will be multiplexed to these users unnecessarily, and will just be dropped without having any further use.
We should also take into account the scalability of this approach. For every geohash that either producer or consumer needs, another channel will be created.
It is definitely more expensive to have an app that requires topics based on geohashes worldwide, than an app that requires only theme-based topics.
That is, on world-wide adoption, the number of topics increases dramatically, hence will the price.
Another consideration is that our app requires an additional number of channels:
By geohash and group: Our app allows the possibility to create geolocation based groups (which would be the equivalent of Twitter like #hashtags).
By place
By followed users (premium feature)
There are a few optimistic considerations to this approach despite:
Streaming is only required when the newsfeed is active:
when the user has a browser window open with our website +
when the user is on a mobile device, and actively has the related feed open
Further optimisation can be done, e.g. only start streaming as from 10
to 20 seconds after refresh of the feed
Streaming by place / followed users may have high traffic depending on current activity, but many place channels will be idle as well
A very important note in this regard is how Ably bills its consumers, which can be used to our full advantage:
A channel is opened when any of the following happens:
A message is published on the channel via REST
A realtime client attaches to the channel. The channel remains active for the entire time the client is attached to that channel, so
if you connect to Ably, attach to a channel, and publish a message but
never detach the channel, the channel will remain active for as long
as that connection remains open.
A channel that is open will automatically close when all of the
following conditions apply:
There are no more realtime clients attached to the channel At least
two minutes has passed since the last message was published. We keep
channels alive for two minutes to ensure that we can provide
continuity on the channel as part of our connection state recovery.
As an example, if you have 10,000 users, and at your busiest time of
the month there is a single spike where 500 customers establish a
realtime connection to Ably and each attach to one unique channel and
one global shared channel, the peak number of channels would be the
sum of the 500 unique channels per client and the one global shared
channel i.e. 501 peak channels. If throughout the month each of those
10,000 users connects and attaches to their own unique channel, but
not necessarily at the same time, then this does not affect your peak
channel count as peak channels is the concurrent number of channels
open at any point of time during that month.
Optimistic conclusion
The most important conclusion is that we should consider that this feature may not be as crucial as believe it is for a first version of the app.
Although Twitter, Facebook, etc offer this feature of receiving live updates (and users have grown to expect it), an initial beta of our app on a limited scale can work without, i.e. the user has to refresh in order to receive new updates.
During a first launch of the app, statistics can ba gathered to gain more insight into detailed user behaviour. This will enable us to build more solid infrastructural and financial reflections based on factual data.
Putting aside the question of Ably, Pubnub and a DIY solution, the core of the question is this:
Where is message filtering taking place?
There are three possible solution:
The Pub/Sub service.
The Server (WebSocket connection handler).
Client side (the client's device).
Since this is obviously a mobile oriented approach, client side message filtering is extremely rude, as it increases data consumption by the client while much of the data might be irrelevant.
Client side filtering will also increase battery consumption and will likely result in lower acceptance rates by clients.
This leaves pub/sub filtering (channel names / pattern matching) and server-side filtering.
Pub/Sub channel name filtering
A single pub/sub service serves a number of servers (if not all of them), making it a very expensive resource (relative to the resources we have at hand).
Using channel names to filter messages would be ideal - as long as the filtering is cheap (using exact matches with channel name hash mapping).
However, pattern matching (when subscribing to channels with inexact names, such as "users.*") is very expansive when compared to exact pattern matching.
This means that Pub/Sub channel name filtering can't be used to filter all the messages without overloading the pub/sub system.
Server side filtering
Since a server accepts WebSocket connections and bridges between the WebSocket and the pub/sub service, it's in an ideal position to filter the messages.
However, we don't want the server to process all the messages for all the clients for each connection, as this is an extreme duplication of effort.
Hybrid solution
A classic solution would divide the earth into manageable sections (1 sq. km per section will require 510.1 million unique channel names for full coverage... but I would suggest that the 70% ocean space should be neglected).
Busy sections might be subdivided (NYC might require a section per 250 sq meters rather than 1 sq kilometer).
This allows publishers to publish to exact channel names and subscribers to subscribe to exact channel names.
Publishers might need to publish to more than one channel and subscribers might need to subscribe to more than one channel, depending on their exact location and the grid's borders.
This filtering scheme will filter much, but not all.
The server node will need to look into the message, review it's exact geo-location and filter messages before deciding if they should be sent along the WebSocket connection to the client.
Why the Hybrid Solution?
This allows the system to scale with relative ease.
Since server nodes are (by design) cheaper than the pub/sub service, they could be used to handle the exact location filtering (the heavy work).
At the same time, the strength of the pub/sub system can be used to minimize the server's workload and filter the obvious mis-matches.
Pubnub vs. Ably?
I don't know. I didn't use either of them. I worked with Redis and implemented my own pub/sub solution.
I assume they are both great and it's really up to your needs.
Personally I prefer the DIY approach when it comes to customized or complex situations. IMHO, this seems like it would fall into the DIY category if I were to implement it.

Optimal retransmission algorithm for a broadcast channel

I have a number of microcontrollers which can communicate over a broadcast medium (in this case, IR). Each node wants to announce its presence to all the other nodes on a regular basis, but since it's a broadcast medium, two nodes transmitting at the same time will create a collision, and neither message will be transmitted correctly.
To further complicate things, when a node receives an invalid message, it doesn't seem practical to reliably determine if this is due to a collision or due to background noise or weak signal. Also, while node visibility will usually be reflexive (A seeing B means B can see A), it will usually be the case that not all nodes can see all other nodes.
Leaving aside external interference for a moment, it seems like a rational approach is to create timeslots, with each node transmitting in each time slot with a small (and ideally similar between nodes) probability. If there are n nodes that transmit with probability p, then no message will be transmitted (1 - p)n of the time; exactly one message will be transmitted n * p * (1 - p)n-1 of the time, and the remainder of the time there will be a collision. The maximum incidence of successful collisions occurs when each of the n nodes transmits 1/nth of the time, and this results in a fairly stable 38% successful transmissions and 24% collisions; this value changes only slightly with increasing numbers of nodes.
Given that, it would seem we could observe the rate of successful transmissions and/or collisions, and adjust our own transmission rate to try and force them towards their expected values. What I'm not sure about is what the best feedback mechanism to achieve this is, such that everyone ends up with similar probabilities. This also doesn't account for external interference, which will cause us to keep decreasing our transmission rate in a doomed effort to avoid collisions that don't exist.
What is the optimal algorithm - either a refinement of the above or a totally different approach - to maximize the proportion of announcement messages each node can receive without collision?
What you want is actually a wireless mesh network, which is a whole reseach field by itself. Apart from that acknowledgement messages are a good idea. You can piggyback them with the announcements themselves if packet overhead is a concern. They allow to distinguish between collisions and background noise by observing how many neighbors answered and allow basic rate limiting by waiting until (most) neighbors acknowledge each transmission (stop-and-go). This gets of course even more complicated if the controllers are very mobile.
It will be very hard (and probably not desirable) to maintain a equal update rate over the whole network, because transmission quality can vary wildly in different areas and points in time.

When to use Paxos (real practical use cases)?

Could someone give me a list of real use cases of Paxos. That is real problems that require consensus as part of a bigger problem.
Is the following a use case of Paxos?
Suppose there are two clients playing poker against each other on a poker server. The poker server is replicated. My understanding of Paxos is that it could be used to maintain consistency of the inmemory data structures that represent the current hand of poker. That is, ensure that all replicas have the exact same inmemory state of the hand.
But why is Paxos necessary? Suppose a new card needs to be dealt. Each replica running the same code will generate the same card if everything went correct. Why can't the clients just request the latest state from all the replicated servers and choose the card that appears the most. So if one server had an error the client will still get the correct state from just choosing the majority.
You assume all the servers are in sync with each other (i.e., have the same state), so when a server needs to select the next card, each of the servers will select the exact same card (assuming your code is deterministic).
However, your servers' state also depends on the the user's actions. For example, if a user decided to raise by 50$ - your server needs to store that info somewhere. Now, suppose that your server replied "ok" to the web-client (I'm assuming a web-based poker game), and then the server crashed. Your other servers might not have the information regarding the 50$ raise, and your system will be inconsistent (in the sense that the client thinks that the 50$ raise was made, while the surviving servers are oblivious of it).
Notice that majority won't help here, since the data is lost. Moreover, suppose that instead of the main server crashing, the main server plus another one got the 50$ raise data. In this case, using majority could even be worse: if you get a response from the two servers with the data, you'll think the 50$ raise was performed. But if one of them fails, then you won't have majority, and you'll think that the raise wasn't performed.
In general, Paxos can be used to replicate a state machine, in a fault tolerant manner. Where "state machine" can be thought of as an algorithm that has some initial state, and it advances the state deterministically according to messages received from the outside (i.e., the web-client).
More properly, Paxos should be considered as a distributed log, you can read more about it here: Understanding Paxos – Part 1
Update 2018:
Mysql High Availability uses paxos: https://mysqlhighavailability.com/the-king-is-dead-long-live-the-king-our-homegrown-paxos-based-consensus/
Real world example:
Cassandra uses Paxos to ensure that clients connected to different cluster nodes can safely perform write operations by adding "IF NOT EXISTS" to write operations. Cassandra has no master node so two conflicting operations can to be issued concurrently at multiple nodes. When using the if-not-exists syntax the paxos algorithm is used order operations between machines to ensure only one succeeds. This can then be used by clients to store authoritative data with an expiration lease. As long as a majority of Cassandra nodes are up it will work. So if you define the replication factor of your keyspace to be 3 then 1 node can fail, of 5 then 2 can fail, etc.
For normal writes Caassandra allows multiple conflicting writes to be accepted by different nodes which may be temporary unable to communicate. In that case doesn't use Paxos so can loose data when two Writes occur at the same time for the same key. There are special data structures built into Cassandra that won't loose data which are insert-only.
Poker and Paxos:
As other answers note poker is turn based and has rules. If you allow one master and many replicas then the master arbitrates the next action. Let's say a user first clicks the "check" button then changes their mind and clicks "fold". Those are conflicting commands only the first should be accepted. The browser should not let them press the second button it will disable it when they pressed the first button. Since money is involved the master server should also enforce the rules and only allow one action per player per turn. The problem comes when the master crashes during the game. Which replica can become master and how do you enforce that only one replica becomes master?
One way to handle choosing a new master is to use an external strong consistently service. We can use Cassandra to create a lease for the master node. The replicas can timeout on the master and attempt to take the lease. As Cassandra is using Paxos it is fault tolerant; you can still read or update the lease even if Cassandra nodes crash.
In the above example the poker master and replicas are eventually consistent. The master can send heartbeats so the replicas know that they are still connected to the master. That is fast as messages flow in one direction. When the master crashes there may be race conditions in replicas trying to be the master. Using Paxos at that point gives you strong consistently on the outcome of which node is now the master. This requires additional messages between nodes to ensure a consensus outcome of a single master.
Real life use cases:
The Chubby lock service for loosely-coupled distributed systems
Apache ZooKeeper
Paxos is used for WAN-based replication of Subversion repositories and high availability of the Hadoop NameNode by the company I work for (WANdisco plc.)
In the case you describe, you're right, Paxos isn't really necessary: A single central authority can generate a permutation for the deck and distribute it to everyone at the beginning of the hand. In fact, for a poker game in general, where there's a strict turn order and a single active player as in poker, I can't see a sensible situation in which you might need to use Paxos, except perhaps to elect the central authority that shuffles decks.
A better example might be a game with simultaneous moves, such as Jeopardy. Paxos in this situation would allow all the servers to decide together what sequence a series of closely timed events (such as buzzer presses) occurred in, such that all the servers come to the same conclusion.

Determining the fastest BitTorrent peer connection

How does one determine which peer you are connected to has the fastest connection(upload rate)?
Does the actual connection of the peer dominate who is fastest or will the peer who needs the most chunks cause him to upload the fastest as less people are downloading from him?
I want to write an algorithm which takes all the peers in the peer list returned from the tracker and determine wither which peers are closer using a ping and timing the response or some other way.
Thanks
A ping (ICMP echo request/reply) will give you the latency of a peer, but not the available bandwidth the peer has. You want the bandwidth since TCP is good at doing bandwidth*delay products and figuring out how to make a connection fast, even if it roundtrips a satellite.
What you do is to connect to all of them. Having 40 peers connected is not uncommon. And then you decide upon which to unchoke based on their current rates towards you (until you become a seeder). It also has to be fairly dynamic, since available bandwidth change over time. The best advice I can give is to read
http://www.bittorrent.org/bittorrentecon.pdf
which gives the general idea of how to implement the economics. But many clients do different things than the paper, so reading code is another option.
So: You want to measure bandwidth, not latency. Hence, ping is the wrong tool for the job. Measuring bandwidth is most easily done by tracking the rate at which you send packets to a peer.
I think that the choking/unchoking algorithm and selecting peers to unchoke is one of the hardest parts to get right in a client. It is best solved with pen, paper and brain, not by sitting in front of the computer writing code.

Resources