Submitted: 8 July 2022
Major revision: 3 October 2022
Reviewer 1: Author Should Prepare A Minor revision
Reviewer 2: Author Should Prepare A Minor Revision
Reviewer 3: Author Should Prepare A Major Revision For A Second Review
Reviewer 4: Author Should Prepare A Major Revision For A Second Review
Reviewers appreciate the manuscript and provide feedback on several points for further improvement. I would like to encourage authors to address them, especially from reviewer 3 and 4, in the manuscript.
Author Should Prepare A Minor Revision
Please rate the manuscript. Please explain your choice.: Good
Author Should Prepare A Minor Revision
The paper addresses a relevant problem, offers an interesting solution, is well structured, easy to follow and a pleasure to read.
Following the textual description and the pen-and-paper proof, I am
confident that the protocol is sound (to the extend humans are able to
verify non-machine-checked proofs). However, I think the pseudocode in
Algorithm 1 contains two typos that need to be fixed: Unless I
misunderstood the protocol, before a vote reaches confirmation, (having
voted already) should support the winning value of the previous round
and relay with his vote for round r+1 all previous votes. That is,
Line 23: should be value' <- WINNINGVALUE(votes[r])
and
Line 25: votes[r+1] <- {vote} \cup votes[r+1]
If the algorithm should remain as presented, please indicate how replicas are prevented from continuing to support different values if the decision on which they base their selection of the winning vote are not communicated.
Also, although the article claims to deal with the case that "not all replicas are connected to each other" pg 2 2nd paragraph, the protocol itself collects n-f proposals from different replicas before any progress is made, where f is the fault threshold.
Please clarify the fault model that a temporarily disconnected but otherwise healthy replica is subsumed by f or explain how the protocol achieves this progress. In particular, why is it not enough to operate on a safe and life quorum with less than n-f replicas (e.g., for n > 3f+1, the conditions 2 |Q| - n >= f+1 and |Q| <= n-f, allow for additional replicas to be late that may not be faulty)?
It would have been nice if the latency in Fig.8 would as well show the comparisson to tendermint.
I recommend acceptance after a minor revision.
Please rate the manuscript. Please explain your choice.: Excellent
Author Should Prepare A Major Revision For A Second Review
This paper tries to propose a browser-based platform for client-centric community-driven applications. Towards this, the authors also devise a new leaderless Byzantine fault tolerant consensus and a state-based synchronization mechanism. However, there are some confusions or drawbacks as follows:
Please rate the manuscript. Please explain your choice.: Fair
Author Should Prepare A Major Revision For A Second Review
The paper presents a novel platform relying on a leaderless BFT consensus protocol, which compared to existing protocols, behaves well in networks with a high rate of network failures. This is shown experimentally by comparing BeauForT with BFT-SMaRt and Tendermint, two state-of-the-art protocols. The framework combines a novel leaderless BFT protocol, with a reliable gossip protocol to synchronize the states of the replicas (which seems to be adapted from the OWebSync framework). The paper overall reads well, and encouraging results are provided in the evaluation section. It is however unclear to which extent the paper advances the state-of-the-art as highlighted below. Here are my concerns with the current version of the paper.
- Unclear system model. The authors do not specify in section 3.1 any constraint on the connectivity of the network. However, the network needs to be sufficiently connected for liveness. Do you require the standard 2f+1-connectivity constraint?
- Missing related work. Many relevant papers on leaderless consensus are missing in the related work section such as
The paper claims a novel leaderless consensus protocol but the related work is not discussed. This new algorithm should be compared to state-of-the-art leaderless protocols.
The paper claims in section 3.1 that "In such leader-based protocols, the failure of a leader leads to a long delay before consensus can be reached". While this is true for protocols with dedicated view-change protocols such a PBFT, it is not necessarily true for protocols with rotating leaders such as HotStuff or StreamLet.
- Integration of the gossip protocol. Some pseudo-code is provided for the consensus protocol. However the gossip protocol is only discussed at a high level in section 3.2.4. It is not clear from the text how this gossip protocol works and how it is integrated in the consensus protocol (for example, what is an invalid state?), which appears to rely on the gossip protocol to at least relay the initial state of the replicas. This should be clarified.
Section 3.2.4 mentions that this gossip protocol is similar for example to OWebSync. How does it differ from it?
According to the text in section 3.2.2 on page 4, it seems that the gossip protocol is used to gossip the entire state of the replicas. However, the text mentions that this is used to guarantee that votes[0] will become the same across the replicas. Why not just gossip votes[0]?
It sounds like this gossip protocol could potentially lead to large messages or to a large number of exchanged messages. Could you discuss the message complexity of your protocol?
Section 3.2.2 also mentions that the lexicographic order of the hash values of the votes is used to break ties. Wouldn't this prevent most client requests from being accepted if a malicious node keeps on proposing a request with a low hash value?
- Use case. It is not clear how suitable this protocol is for the use case detailed in section 5.1. Quorums need to be formed to validate requests. Do you envision that users will be willing to keep running this application to help requests from other users to be accepted? What will be the incentive?
- Algorithm 1. Some information is missing to follow the pseudo-code presented in Algorithm 1. Some notation is left undefined there. For example, VOTE seems to send a vote messages, and VOTESFORVALUE appears to compute the votes for a given value in a list of votes, however, those are not explicitly defined.
The non-proposing replicas "wait for any value in votes" line 10. How does votes get populated for non-proposing replicas prior to starting the pre-commit phase?
I did not get what the code line 28-30 does. Could this be clarified?
In general, adding pointers to the code in the text would help follow the algorithm.
- Example. While the example provided in section 3.2.5 is very useful to follow what the protocol does, an example involving corner cases such as multiple initial votes would have been more illuminating. Could you add such an example?
- Signatures. What is sigma^ti_i in section 4.2? Is it related to sigma_i?
- Experiments. 10 repeats does not appear to be much especially given the tail latency reported in figure 5. Do you get similar results with more repeats?
Please specify the connectivity between replicas, and whether it changes from one run to another.
How much is the increased latency reported in figure 5 due to the multi-hop connections between nodes, and how much is it due to the expansive BLS scheme (at least for small systems)?
Could you explain the sudden latency increase for BFT-SMaRt in figure 7, when going from 20 replicas to 40 replicas?
The text mentions the median latency of BFT-SMaRt in figure 7. Where is the median latency?
- Proofs in appendix A. The proof of lemma 3 refers to a quorum certificate being accepted, which presumably refers to the fact some replica created such a certificate, but this is not explained. Does this refer to line 32? Could this be clarified?
I didn't quite follow the proof of lemma 5. How can one replica observing n-f votes for value1 lead to n-f votes cast for some quorum certificate. What if this replica is faulty? Doesn't this lemma require a majority of replicas observing these votes?
What does "known" means in "At least n - f votes are known"? Does it mean known by n-f replicas or just by 1 replica?
I didn't quite get this sentence: "either an unknown vote stays unknown, or it becomes known" in the proof of lemma 6. Shouldn't it be: "either all unknown votes stay unknown or one becomes known"?
The proof of lemma 7 mentions Byzantine replicas being excluded. Is that required for the protocol to be live? In that case, could this exclusion mechanism be described in the main body of the paper?
Please rate the manuscript. Please explain your choice.: Good