BFT and PBFT and BA consensus algorithm - algorithm

I've been digging into some of the most used consensus algorithms in permissioned blockchains and I got stuck.
I understood that BFT (Byzantine Fault Tolerance) is a property of some algorithms and pBFT is an algorithm itself. Is that right?
This rule that 2/3 of the nodes in the network are enough to reach consensus, is it for all BFT algorithms or just pBFT?
Also what is the difference between Byzantine Agreement and BFT?
If you could provide a reliable source of information, I'd be thankful.

I understood that BFT (Byzantine Fault Tolerance) is a property of some algorithms and pBFT is an algorithm itself. Is that right?
Yes.
This rule that 2/3 of the nodes in the network are enough to reach consensus, is it for all BFT algorithms or just pBFT?
Algorithms for Byzantine agreement can tolerate at most f failures in 3f+1 nodes, but they may not even be able to tolerate that many. The reasoning is that, if the Byzantine nodes stop participating, then n-f nodes have to be able to reach consensus, but if message delays temporarily hide f good nodes, then the remaining good nodes should be in the majority (so n-f >= 2f+1, and n >= 3f+1).
Also what is the difference between Byzantine Agreement and BFT?
The former is a distributed computing problem, more often referred to as consensus. The latter is a property of a protocol.

Related

Leader election process with Timestamps

trying to create a leader election protocol with my specifications but i fail. Let me describe more detail.
Let's imagine we have 5 nodes A,B,C,D,E and let's also assume that all of them compete to solve a puzzle something like PoW but for our scenario is irrelevant. Let's also assume that both C,D solve simultaneously the puzzle and send the proofs for verifications in the other nodes.
A[C2,D3]
B[C4,D2]
E[C3,D6]
Nodes A,B,E receives messages from C,D. For example node A receives a message from C at timestamp 2 and from D at timestamp 3 it also verifies proofs that C,D solves the puzzle successfully.
What i am searching for now is to pick the fastest node from the winners C,D and recognize him as the leader of the protocol. As a result, they all exchange their messages and they average the times to find the winner node with the smallest time. if randomly pick node B it will calculate the time for C [4+2+3]/3=3, D[3+2+6]/3=3,6.
Hence every node will calculate time 3 for C and time 3,6 D and finally, they choose C with the lowest timestamp as a leader.
Am i correct until now does this process has logic?
My big problem now is what happened if B node is malicious and try to trick the protocol and send different values to A,E this will confuse all node so they cannot reach a consensus. How can we fix this and pass this step?
Can anyone give me any idea?
Falsehoods programmers believe about time applies. Particularly the bits about trusting the system clock.
The first good algorithm for leader election was Paxos. A number of others have emerged since. See here for a partial list.
Getting distributed programming right is hard. In the past decade, Jepsen has analyzed a lot of systems. In the overwhelming majority of cases the guarantees that the authors thought they could validly give about how their systems would fail were wrong. You shouldn't expect to beat that average. Therefore I strongly recommend that you delegate leader election to proven software such as Zookeeper rather than trying to roll your own implementation.

How to calculate number of messages in BFT?

Do you know guys any mathematical equation for calculating the whole number of messages for agreement on one value in BFT (Byzantine Fault Tolerance) algorithm?
Let's say that N = 7 and f = 2
So, how can we calculate how many messages will be sent between nodes until the agreement on one value (consensus)?
It depends on the algorithm. Take a look on the PBFT paper (google it) - they have linear number of messages to agree on the operation.
Today there are a lot of BFT algorithms with different optimisations. Some of them reduces number of messages as well.
Tendermint and SBFT are good examples.
Disclaimer: I work on the SBFT algorithm implementation in VMWare.

What is an excution of a distributed algorithm from mathematical point of view?

I have a distributed algorithm for mobile ad hoc networks. How can I express the execution of the algorithm from a mathematical point of view? is it a tuple ? or is it a set of time instants during which the algorithm is executed?
I doubt there is a universally accepted or a "one-size-fits-all" model for the execution of distributed algorithms. Below are some possible starting points (but only that: just starting points) for exploring the issue:
H. J. Boehm, Reordering Constraints for Pthread-Style Locks, 2005
The page on Temporal logic from Wikipedia
Regarding the issue of what an instant means for distributed algorithms, you may want to read L. Lamport, Time, clocks, and ordering of events in a distributed system, 1978.

Is there a non-deterministic leader election algorithm?

I was wondering whether a non-deterministic leader election algorithm exists (in a one directional ring) that ensures termination. I cannot think of one nor can I find one that is non-deterministic. Some that I've found are:
Select the node with the highest process ID to be the leader (deterministic) and terminates.
Randomly decide whether a node wants to be a leader, if in the ring there is another node that wants to be a leade, restart the whole process. This one does not terminate, but has a probability of termination.
Can someone either give me some hints on how to create a distributed non-deteministic leade election algorithm? And maybe explain it in layman terms.
Thanks for everything!
There exists no probabilistic (anonymous) leader election algorithm with both, a guaranteed termination and a guaranteed correctness (only one leader). If I remember right, you will find a proof in N. Lynch's book on Distributed Systems.
However, there exists algorithms with a probability limit of zero for non-termination for a sufficient long runtime of the algorithm. In addition, the expected runtime is rather short (AFIR, in the order of ln(k) for k initiators).
The main idea for such an algorithm follows your second approach. However, do not simple restart the process in case of several leaders, but only allow the winners of the last round to become candidates in the next round.
EDIT 1-3:
If you asking for a non-anonymous leader election, there are several probabilistic algorithms that guarantees termination. E.g., take a normal ring algorithm and delay messages with with a certain probability, as smaller the ID as greater the chance for delay. In this way, messages with low chance of winning are erased earlier, resulting in less overall messages.
If you want to have a different outcome for non-anonymous members, you can e.g. use a two phase algorithm:
Perform a classical leader election => a nodes A with the highest ID wins
Let A roll a dice to determine the actual leader.
The element of fortuity could also be distributed:
Any nodes knows the set of identities (S) (if not, use a flooding algorithm to tell)
All nodes select by chance an ID of ouf S and send it to any other node
The ID that is named most often, wins. If there is more than one such ID, select one of them in a deterministic way, e.g., the median.
Termination and a non-deterministic outcome are granted for both alogrithm. However, the first has a lower average message complexity (n log n vs. n² ; the worst case complexity is the same) and is more fair (i.e., the probability that a ID wins is equaly distributed, what is not true for the second algorithm). I'm pretty sure, that at least the last disadvantage can eliminated by a more sophisticated algorithm, but the question here was for the general existence of such an algorithm.
From what I understand from the question, you're not actually looking for an election algorithm, just a distributed algorithm to fairly and randomly choose one client as "leader," where no subset of clients can work together to cheat.
This is actually pretty easy. Treat each client as a card in a deck of cards, then use the mental poker algorithm (which is a distributed algorithm to fairly and randomly shuffle a deck of cards) to shuffle it. Then just take the first card as the leader.

Determining the best k for a k nearest neighbour

I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Resources