TPDS-2022-07-0416 Reviews and Comments

TPDS-2022-07-0416 BeauForT: Robust Byzantine Fault Tolerance for Client-centric Mobile Web Applications

Submitted: 8 July 2022
Major revision: 3 October 2022

Reviewer 1: Author Should Prepare A Minor revision
Reviewer 2: Author Should Prepare A Minor Revision
Reviewer 3: Author Should Prepare A Major Revision For A Second Review
Reviewer 4: Author Should Prepare A Major Revision For A Second Review

Editor Comments

Comments to the Author

Reviewers appreciate the manuscript and provide feedback on several points for further improvement. I would like to encourage authors to address them, especially from reviewer 3 and 4, in the manuscript.

Reviewer 1

Recommendation

Author Should Prepare A Minor Revision

Comments

Additional Questions:

  1. Please explain how this manuscript advances this field of research and/or contributes something new to the literature.: This research proposes a novel, optimistic, leaderless, gossip-based consensus protocol, tolerating Byzantine replicas, combined with a robust and efficient state-based synchronization protocol.
  2. Is the manuscript technically sound? Please explain your answer under Public Comments below.: Yes
  1. Which category describes this manuscript?: Research/Technology
  2. How relevant is this manuscript to the readers of this periodical? Please explain your rating under Public Comments below.: Relevant
  1. Are the title, abstract, and keywords appropriate? Please explain under Public Comments below.: Yes
  2. Does the manuscript contain sufficient and appropriate references? Please explain under Public Comments below.: Number of references are excessive
  3. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer under Public Comments below.: Yes
  4. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please explain under Public Comments below.: Satisfactory
  5. Please rate the readability of the manuscript. Explain your rating under Public Comments below.: Easy to read
  6. Should the supplemental material be included? (Click on the Supplementary Files icon to view files): Does not apply, no supplementary files included
  7. If yes to 6, should it be accepted: As is
  8. Would you recommend adding the code/data associated with this paper to help address your concerns and/or strengthen the paper?: No

Please rate the manuscript. Please explain your choice.: Good

Reviewer 2

Recommendation

Author Should Prepare A Minor Revision

Comments

The paper addresses a relevant problem, offers an interesting solution, is well structured, easy to follow and a pleasure to read.

Following the textual description and the pen-and-paper proof, I am confident that the protocol is sound (to the extend humans are able to verify non-machine-checked proofs). However, I think the pseudocode in Algorithm 1 contains two typos that need to be fixed: Unless I misunderstood the protocol, before a vote reaches confirmation, (having voted already) should support the winning value of the previous round and relay with his vote for round r+1 all previous votes. That is,
Line 23: should be value' <- WINNINGVALUE(votes[r])
and
Line 25: votes[r+1] <- {vote} \cup votes[r+1]

If the algorithm should remain as presented, please indicate how replicas are prevented from continuing to support different values if the decision on which they base their selection of the winning vote are not communicated.

Also, although the article claims to deal with the case that "not all replicas are connected to each other" pg 2 2nd paragraph, the protocol itself collects n-f proposals from different replicas before any progress is made, where f is the fault threshold.

Please clarify the fault model that a temporarily disconnected but otherwise healthy replica is subsumed by f or explain how the protocol achieves this progress. In particular, why is it not enough to operate on a safe and life quorum with less than n-f replicas (e.g., for n > 3f+1, the conditions 2 |Q| - n >= f+1 and |Q| <= n-f, allow for additional replicas to be late that may not be faulty)?

It would have been nice if the latency in Fig.8 would as well show the comparisson to tendermint.

I recommend acceptance after a minor revision.

Additional Questions:

  1. Please explain how this manuscript advances this field of research and/or contributes something new to the literature.: This manuscript proposes a Byzantine fault tolerant protocol for web-based applications to agree on a value (e.g., of a key in a key-value store). The proposed protocol aims to address the case where not all replicas are available and closely connected, by advancing in a leaderless manner.
  2. Is the manuscript technically sound? Please explain your answer under Public Comments below.: Partially
  1. Which category describes this manuscript?: Research/Technology
  2. How relevant is this manuscript to the readers of this periodical? Please explain your rating under Public Comments below.: Relevant
  1. Are the title, abstract, and keywords appropriate? Please explain under Public Comments below.: Yes
  2. Does the manuscript contain sufficient and appropriate references? Please explain under Public Comments below.: References are sufficient and appropriate
  3. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer under Public Comments below.: Yes
  4. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please explain under Public Comments below.: Satisfactory
  5. Please rate the readability of the manuscript. Explain your rating under Public Comments below.: Easy to read
  6. Should the supplemental material be included? (Click on the Supplementary Files icon to view files): Does not apply, no supplementary files included
  7. If yes to 6, should it be accepted: After revisions. Please include explanation under Public Comments below.
  8. Would you recommend adding the code/data associated with this paper to help address your concerns and/or strengthen the paper?: Yes

Please rate the manuscript. Please explain your choice.: Excellent

Reviewer 3

Recommendation

Author Should Prepare A Major Revision For A Second Review

Comments

This paper tries to propose a browser-based platform for client-centric community-driven applications. Towards this, the authors also devise a new leaderless Byzantine fault tolerant consensus and a state-based synchronization mechanism. However, there are some confusions or drawbacks as follows:

  1. It seems that the application scenario of client-centric mobile web applications have a varying and uncertain number of replicas. In other words, the environment is permissionless. However, the proposed BFT consensus is permissioned.
  2. The leaderless BFT consensus may lose liveness property. To be more specific, if there are four replicas (A, B, C, and D), where D is Byzantine. In the first round, A, B, and D send the vote to A for the value (m) and C send the vote for the value (n), A gets 3 votes for m in PRE-COMMIT phase and enters COMMIT phase. However, both B and C does not get 3 votes for either m or n, since D does not send votes to them. Therefore, both B and C enters the next round. Since A does not enter the next round and D is Byzantine, none of B or C can proceed and liveness is compromised.
  3. It seems that the client needs to send the values to all the replicas, whose communication overhead can be quite high.
  4. Some statements are not accurate. For example, in Section 1, the authors state that Algorand assumes every node is connected to all other nodes. In fact, this is not true since Algorand works in a permissionless environment.
  5. Some typos. For example: In Fig. 1, when the replica chooses to enter the new round or commit the view?

Additional Questions:

  1. Please explain how this manuscript advances this field of research and/or contributes something new to the literature.: The authors devise a new leaderless Byzantine fault tolerant consensus and a state-based synchronization mechanism for client-centric mobile web applications.
  2. Is the manuscript technically sound? Please explain your answer under Public Comments below.: Appears to be - but didn't check completely
  1. Which category describes this manuscript?: Research/Technology
  2. How relevant is this manuscript to the readers of this periodical? Please explain your rating under Public Comments below.: Relevant
  1. Are the title, abstract, and keywords appropriate? Please explain under Public Comments below.: No
  2. Does the manuscript contain sufficient and appropriate references? Please explain under Public Comments below.: References are sufficient and appropriate
  3. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer under Public Comments below.: Yes
  4. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please explain under Public Comments below.: Could be improved
  5. Please rate the readability of the manuscript. Explain your rating under Public Comments below.: Readable - but requires some effort to understand
  6. Should the supplemental material be included? (Click on the Supplementary Files icon to view files): Does not apply, no supplementary files included
  7. If yes to 6, should it be accepted:
  8. Would you recommend adding the code/data associated with this paper to help address your concerns and/or strengthen the paper?: No

Please rate the manuscript. Please explain your choice.: Fair

Reviewer 4

Recommendation

Author Should Prepare A Major Revision For A Second Review

Comments

The paper presents a novel platform relying on a leaderless BFT consensus protocol, which compared to existing protocols, behaves well in networks with a high rate of network failures. This is shown experimentally by comparing BeauForT with BFT-SMaRt and Tendermint, two state-of-the-art protocols. The framework combines a novel leaderless BFT protocol, with a reliable gossip protocol to synchronize the states of the replicas (which seems to be adapted from the OWebSync framework). The paper overall reads well, and encouraging results are provided in the evaluation section. It is however unclear to which extent the paper advances the state-of-the-art as highlighted below. Here are my concerns with the current version of the paper.

- Unclear system model. The authors do not specify in section 3.1 any constraint on the connectivity of the network. However, the network needs to be sufficiently connected for liveness. Do you require the standard 2f+1-connectivity constraint?

- Missing related work. Many relevant papers on leaderless consensus are missing in the related work section such as

The paper claims a novel leaderless consensus protocol but the related work is not discussed. This new algorithm should be compared to state-of-the-art leaderless protocols.

The paper claims in section 3.1 that "In such leader-based protocols, the failure of a leader leads to a long delay before consensus can be reached". While this is true for protocols with dedicated view-change protocols such a PBFT, it is not necessarily true for protocols with rotating leaders such as HotStuff or StreamLet.

- Integration of the gossip protocol. Some pseudo-code is provided for the consensus protocol. However the gossip protocol is only discussed at a high level in section 3.2.4. It is not clear from the text how this gossip protocol works and how it is integrated in the consensus protocol (for example, what is an invalid state?), which appears to rely on the gossip protocol to at least relay the initial state of the replicas. This should be clarified.

Section 3.2.4 mentions that this gossip protocol is similar for example to OWebSync. How does it differ from it?

According to the text in section 3.2.2 on page 4, it seems that the gossip protocol is used to gossip the entire state of the replicas. However, the text mentions that this is used to guarantee that votes[0] will become the same across the replicas. Why not just gossip votes[0]?

It sounds like this gossip protocol could potentially lead to large messages or to a large number of exchanged messages. Could you discuss the message complexity of your protocol?

Section 3.2.2 also mentions that the lexicographic order of the hash values of the votes is used to break ties. Wouldn't this prevent most client requests from being accepted if a malicious node keeps on proposing a request with a low hash value?

- Use case. It is not clear how suitable this protocol is for the use case detailed in section 5.1. Quorums need to be formed to validate requests. Do you envision that users will be willing to keep running this application to help requests from other users to be accepted? What will be the incentive?

- Algorithm 1. Some information is missing to follow the pseudo-code presented in Algorithm 1. Some notation is left undefined there. For example, VOTE seems to send a vote messages, and VOTESFORVALUE appears to compute the votes for a given value in a list of votes, however, those are not explicitly defined.

The non-proposing replicas "wait for any value in votes" line 10. How does votes get populated for non-proposing replicas prior to starting the pre-commit phase?

I did not get what the code line 28-30 does. Could this be clarified?

In general, adding pointers to the code in the text would help follow the algorithm.

- Example. While the example provided in section 3.2.5 is very useful to follow what the protocol does, an example involving corner cases such as multiple initial votes would have been more illuminating. Could you add such an example?

- Signatures. What is sigma^ti_i in section 4.2? Is it related to sigma_i?

- Experiments. 10 repeats does not appear to be much especially given the tail latency reported in figure 5. Do you get similar results with more repeats?

Please specify the connectivity between replicas, and whether it changes from one run to another.

How much is the increased latency reported in figure 5 due to the multi-hop connections between nodes, and how much is it due to the expansive BLS scheme (at least for small systems)?

Could you explain the sudden latency increase for BFT-SMaRt in figure 7, when going from 20 replicas to 40 replicas?

The text mentions the median latency of BFT-SMaRt in figure 7. Where is the median latency?

- Proofs in appendix A. The proof of lemma 3 refers to a quorum certificate being accepted, which presumably refers to the fact some replica created such a certificate, but this is not explained. Does this refer to line 32? Could this be clarified?

I didn't quite follow the proof of lemma 5. How can one replica observing n-f votes for value1 lead to n-f votes cast for some quorum certificate. What if this replica is faulty? Doesn't this lemma require a majority of replicas observing these votes?

What does "known" means in "At least n - f votes are known"? Does it mean known by n-f replicas or just by 1 replica?

I didn't quite get this sentence: "either an unknown vote stays unknown, or it becomes known" in the proof of lemma 6. Shouldn't it be: "either all unknown votes stay unknown or one becomes known"?

The proof of lemma 7 mentions Byzantine replicas being excluded. Is that required for the protocol to be live? In that case, could this exclusion mechanism be described in the main body of the paper?

Additional Questions:

  1. Please explain how this manuscript advances this field of research and/or contributes something new to the literature.: The paper presents a novel platform relying on a leaderless BFT consensus protocol, which compared to existing protocols, behaves well in networks with a high rate of network failures. This is shown experimentally by comparing BeauForT with BFT-SMaRt and Tendermint. The framework combines a novel leaderless BFT protocol, with a reliable gossip protocol to synchronize the states of the replicas (which seems to be adapted from the OWebSync framework). It is however unclear to which extent the paper advances the state-of-the-art as highlighted below.
  2. Is the manuscript technically sound? Please explain your answer under Public Comments below.: Partially
  1. Which category describes this manuscript?: Research/Technology
  2. How relevant is this manuscript to the readers of this periodical? Please explain your rating under Public Comments below.: Very Relevant
  1. Are the title, abstract, and keywords appropriate? Please explain under Public Comments below.: Yes
  2. Does the manuscript contain sufficient and appropriate references? Please explain under Public Comments below.: Important references are missing; more references are needed
  3. Does the introduction state the objectives of the manuscript in terms that encourage the reader to read on? Please explain your answer under Public Comments below.: Yes
  4. How would you rate the organization of the manuscript? Is it focused? Is the length appropriate for the topic? Please explain under Public Comments below.: Satisfactory
  5. Please rate the readability of the manuscript. Explain your rating under Public Comments below.: Readable - but requires some effort to understand
  6. Should the supplemental material be included? (Click on the Supplementary Files icon to view files): Does not apply, no supplementary files included
  7. If yes to 6, should it be accepted:
  8. Would you recommend adding the code/data associated with this paper to help address your concerns and/or strengthen the paper?: Yes

Please rate the manuscript. Please explain your choice.: Good