Eurosys '19 Paper #552 Reviews and Comments

Paper #552 OWebSync: A web middleware with state-based replicated data types and Merkle-trees for the synchronization of distributed web clients

Submitted: 1 October 2018
Rejected: 20 December 2018

Review A: 3. Weak accept
Review B: 2. Weak reject
Review C: 3. Weak accept
Review D: 3. Weak accept
Review E: 3. Weak accept
Review F: 3. Weak accept

Review A

Paper summary

This paper proposes OWebSync, a web middleware with state-based replicated data types and Merkle-trees for swift synchronization of distributed web clients. The evaluation is done for offline and online synchronization, in the presence of network or after a network disruption - the comparison is done to ShareDB and Yjs, both operational-based systems.

Reasons to accept

What changes would make you value the paper higher?

Exciting or thought-provoking?

2. Somewhat, I learned a few things.

Overall merit

3. Weak accept

Detailed comments for authors

The synchronization achieved by OWebSync seems to be due to message batching and Merkle trees. However, there is no breakdown of how much each contribute to the results presented.

The evaluation is done against ShareDB and Yjs - with mostly positive results. It seems that OweBSync does a better job in synchronizing objects when the number of clients is higher - does that mean that only when there exists more communication, one can start seeing the benefits of Merkle trees? It would seem so, but it would be nice to have this properly broken down.

The results for ShareDB in Table 1 (online updates) are very weird. It seems the time to synchronize 100 objects over 8 clients is on average 9x higher than to synchronize 1000 objects across the same number of clients. I see the standard deviation is huge, but it is still unexpected. Can the authors venture out and present some assumptions as to why that is? This part of discussing the results, beyond reciting the numbers already included in tables and figures, is mostly missing. What is it in ShareDB and Yjs that results in these huge standard deviations?

Figure 7 shows the network usage across all 3 compared systems and OWEbSync has by far the largest usage - but still for 1000 clients it outperforms the other 2 systems. Could the authors explain, and better yet, measure how much time is spent doing network transfers and how time is spent doing the actual synchronization?

Results reported in Table 2 for SharedDB actually make much more sense - for the offline scenario. Why is that, and not the case for online updating (Table 1)?

Review B

Paper summary

Introduces a middleware for enterprise workflow applications, which run in a browser and permit disconnected operation. Implementation based on state-based CRDS with Merkle trees to facilitate delta-compression in state transfers.

Long presentation of benchmark results and timings from two representative applications.

Reasons to accept

Interesting applications and promising technology.

Novel idea for enhancing state based CRDT systems with focus on disconnected client web operations.

What changes would make you value the paper higher?

Little novelty past the basic idea of combining state-based CRDT with efficient state transfer.

Non-conclusive evaluation, long tables represent measurements that should be shown graphically.

Exciting or thought-provoking?

2. Somewhat, I learned a few things.

Overall merit

2. Weak reject

Detailed comments for authors

Introduces a web middleware for synchronizing JSON objects between the web browser of clients, targets 20-30 concurrent users on one document. The idea is to enable Google-Docs like collaboration features but with many offline clients.

Implementation uses state-based CRDTs and efficient state transfer. All of the data type implementations are borrowed from [15] however, there are only very brief explanations of the basic objects used. The paper relies on [15] in many places and therefore feels a bit "thin".

Somehow I missed details on the library's API and programming model, this could have been a more interesting contribution.

The library in 3 and the system implementation in 4 are described at a high level, although sometimes unnecessary details are given. For example for Figure 2 it would have been helpful to illustrate some of the protocols that operate on the data, not only the JSON representation of an arbitrary object.

The benchmarks in 5 are too long. The presentation format with the two large tables on p 8 and 10 should be changed to something that is easier to understand. You could still publish the data online for specialists.

Review C

Paper summary

Presents and evaluates OWebSync, a web middleware that support on- and offline concurrent editing of shared documents. Consists of a client-side JS library and server-side process, which cooperate to cache and synchronize JSON data structures build from two CRDTs: ORMaps and LWWRegister. Merkel trees are used internally to aid efficient synchronization. An experimental evaluation shows that, relative to existing work, OWebSync has higher latency and network overhead on small numbers of objects and small numbers of concurrent editors, but scales better to large data sets, more editors, and handles offline users more gracefully.

Reasons to accept

The paper addresses a relevant problem. The paper is very well written, it explains the design and relevant background clearly. The experimental evaluation is careful and explained in detail.

What changes would make you value the paper higher?

A clearer statement of technical novelty and contribution. A better sense of the extent to which OWebSync's better scalability to large numbers of objects, many editors, and offline operation matter in practice.

Exciting or thought-provoking?

1. No, it has little originality.

Overall merit

3. Weak accept

Detailed comments for authors

I enjoyed reading this paper. The writing is very clear and the authors made an effort to explain all the relevant background material clearly. I also appreciate the effort that has gone into building and evaluating a working system.

The paper is not very clear on its contribution. Beyond combining CvRDTs with Merkle trees, how does the system differ from existing work? Or do the authors see the main contribution in studying different synchronization protocols in this particular context?

Thank you for the careful experimental evaluation and for being very clear about both OWebSync's benefits and performance limits. Given that OWebSync shines on large numbers of objects, many concurrent editors, and offline operation, I was wondering what the author's thoughts were on how important these aspects are in practice, relative to the conditions on which the existing work seems to do better?

Review D

Paper summary

In order to support availability and disconnected operation for line-of-business applications, uses operation-based CRDTs. The key contribution is the use of Merkle-Trees to compare replicas and derive optimal update messages (deltas). The paper compares performance with two operation-based approaches, one using CRDTs and the other OT. Experiments show that the new approach propagates updates faster, with more consistent performance, when disconnection lasts a long time or when the number of users is large.

Reasons to accept

What changes would make you value the paper higher?

Exciting or thought-provoking?

3. Yes, I can't stop thinking about it.

Overall merit

3. Weak accept

Detailed comments for authors

I find this work very interesting. It is one of the first full-blown accounts of using CRDTs in a production environment, and of the challenges of using and implementing CRDTs. This has a great value. The comparison with two existing systems with different design choices (one op-based CRDT, one op-based OT) is particularly useful. The experiments include a number of pertinent metrics.

However the comparison stops at the numbers, and does not explain them. I have the intuition that deltas are a big win for catching up after a long disconnect, but that the high setup cost of Merkle trees is not always worth it. Is my intuition correct? I'm sure the authors could say, but they don't.

Furthermore, why does state+Merkle perform better than the other approaches, experimentally, for high numbers of users? I have no intuition why this would be the case, and I fear it might be just an implementation artefact. Pease analyse and explain.

The experiments in the paper do not compare with other (non-Merkle) approaches to transmitting deltas. I am aware of at least two: several studies by Baquero's group [JPDC 2018, arXiv 1803.02750, SRDS 2017, arXiv 1603.01529, etc.], and the work of van der Linde [reference 17]. An experimental comparison with Kleppmann's JSON CRDT would also be pertinent [reference 7].

Detailed comments:

Review E

Paper summary

The paper presents a web middleware that supports synchronization of both online and offline clients that are concurrently editing shared data sets. The data model is a state-based Convergent Replicated Data Type for the replication of JSON data structures and uses Merkle-trees to find data changes (data model best suited for semi-structured data than pure text editing). Experimental evaluation shows that compared with approaches based on operational transformation or operation-based replicated data types (used by ShareDB and Yjs, respectively, both of which are popular Javascript technologies for web-based data synchronization), the proposed approach scales better to tens (up to 24 clients tested) of concurrent editors on a single document and is better at recovering from offline situations.

Reasons to accept

What changes would make you value the paper higher?

Exciting or thought-provoking?

2. Somewhat, I learned a few things.

Overall merit

3. Weak accept

Detailed comments for authors

An interesting paper to read. My comments are listed below in random order of importance:

The paper's evaluation shows that the proposed system scales better to a larger number of concurrent clients (up to 24) and larger number of objects edited before starting to hit the maximum latency before users get annoyed (3-5 seconds). One thing that was not clear was why the authors did not compare with Google docs, which is presumably the most popular online collaborative tool. Given the claim that google doc supports up to 100 concurrent users, it would be interesting to see the latencies that google doc incurs alongside those of the proposed OWebSync.

Please change the title. There's got to be a more concise title representing the work described in the paper! :-)

The eWorkforce study is unnecessary for this paper. The authors list it as a motivating scenario and then ignore it in the evaluation section stating that "technicians typically work on their own data island and the data contains less objects with less frequent changes". I suggest removing this case study since the authors do not evaluate it.

In Section 3 it is stated that in the "Last-Write-Wins" register, a timestamp is kept. It is not clear if clocks are assumed synchronized across clients or even if timestamps are necessary since all the updates are serialized at the server by order of arrival. This was a grey area and needs to be clarified.

Citation [12] is missing venue/publication source.

The timeline figures 9 and 10 were unclear in what they are showing. Fig 10a caption states "The peak at 120 seconds" is due to the 1 minute failure but there the peak of sychronizatoin time is achieved between minute 3 and 4 on the x-axis timeline. In general, this part of the evaluation was unclear. What is the point the authors are trying to make here?

The evaluation section does not explain why OWebSync breaks down where it breaks down and why it outperforms the other approaches. It would be nice to understand the fundamental reason for the superior performance and where this is attributed.

What happens when multiple clients work in disconnected mode? Does this favor OWebSync over the other two approaches compared? It would seem so.

Review F

Paper summary

The paper presents a toolkit for synchronizing state among instances of a web application. The state is synchronized by Merkle tree comparison (a la rsync) rather than communicating a log of updates. The authors argue that state sync works well when participants can go offline for long periods which in the alternative scheme would suffer from log truncation.

The state model must be a CRDT (such as last-writer wins), so that the toolkit can promise conflict freedom even when writes arrive very late.

The evaluation shows that the approach scales better to many clients than existing update-log-based frameworks.

Reasons to accept

I'd probably be willing to try this toolkit in an application.

The evaluation is against real frameworks.

What changes would make you value the paper higher?

The ideas in the paper aren't particularly novel.

I'm concerned that the conflict-free application model would be incompatible with many applications, and I can't tell where the boundary is.

Releasing the toolkit as open source might add value as a "tools paper".

Exciting or thought-provoking?

2. Somewhat, I learned a few things.

Overall merit

3. Weak accept

Detailed comments for authors

[update after the PC discussion: Consider submitting this paper to a tools-oriented venue, such as Usenix ATC.]

The ideas in the paper aren't particularly novel.

The paper uses the term fluent repeatedly as an essential benefit, but never defines or measures fluency.

Why is browser relevant here? Is the main novelty implementation of known ideas in JavaScript? I mean, as a toolkit, JavaScript is a nice place to deploy it, but is there something special about using these techniques in the browser?

The system is designed to handle the particular case of a participant coming online after being gone for a month, while accepting (old) updates from that participant without producing any user-mediated conflicts. The paper should dive a little deeper into the consequences of such a system and what applications can tolerate it. That the conflicts don't call attention to the user might be a feature when the application is overlapping geometric shapes. But clearly it's less sensible for other applications, such as source code or collaborating on a legal contract, where the late-arriving writes may derive from information in the document that has since become obsolete.

p4 "The current data model is of course best suited for semi-structured data that is produced and edited by concurrent users, like the data items in the case studies: graphical templates, a set of tasks or used materials for a task. This data model is less suited for applications like online banking or pure text editing."

This paragraph is trying to address my concern above, but it doesn't really clearly articulate the properties of "semi-structured." I could certainly imagine that "used materials for a task" could suffer from silent data loss in teh same way as the examples I describe above. I'm uncomfortable with the model and the lack of clarity around its limits.

The paper should relate its object-synchronization framework to Terry's data replication work. A nice place to start is: Cimbiosys: A Platform for Content-based Partial Replication Venugopalan Ramasubramanian1, Thomas L. Rodeheffer1, Douglas B. Terry1, Meg Walraed-Sullivan2, Ted Wobber1, Catherine C. Marshall1, Amin Vahdat

p4 "To join two ORMaps, the union of the respective observed and removed set is taken."

That implies that once a key is removed, it can never again be inserted, since it will forever appear in the removed set.

p4 "The conflict resolution of the ORMap boils down to an add-wins resolution."

No it doesn't; see previous remark.

p5 "let d1 = await OWebSync.get("drawings.drawing1");
d1.object36.color = "#f00";"

What is the difference between levels of hierarchy above (drawings.drawing1) and below (.object36.color) the get call?

p6 "To solve this problem, the messages in one batch are processed in parallel."

In JavaScript!? Tell me more!

p8, Table 1: Why isn't this information presented graphically? This is a visually impenetrable matrix of raw data.

p7,9 "However, the update times for Yjs and shareDB increase significantly when going beyond 8 clients or when synchronizing 1000 objects."

Not sharedb with 8 clients.

p9 "Even in the benchmark with 24 clients and 1000 objects, the used bandwidth is less then 800 kbit/s per client. This is much less than the available bandwidth which is on average 27 Mbit/s on a mobile network in the US [27]."

Tell me about your mobile carrier, please!

Figure 7 is a bit distressing, but this is also a bit of an exaggerated workload. Batching would perhaps help mitigate the bandwidth cost for real workloads with locality?

Comment @A1 by Reviewer F

PC meeting summary

The committee thought that a object synchronization tool would be useful. However, we found the proposed system design not to bring much in the way of unexpected insights. The evaluation also doesn't provide great insight.