Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols

Dong, He; Liu, Shengyun

doi:10.3390/electronics13030481

Open AccessArticle

Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols

by

He Dong

^* and

Shengyun Liu

^*

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(3), 481; https://doi.org/10.3390/electronics13030481

Submission received: 5 December 2023 / Revised: 13 January 2024 / Accepted: 16 January 2024 / Published: 23 January 2024

(This article belongs to the Special Issue Advanced Theories, Applications and Techniques in Cloud and Distributed Computing)

Download

Browse Figures

Versions Notes

Abstract

:

In the era of cloud computing, the reliability and efficiency of distributed systems, particularly in cloud-based databases and applications, are important. State Machine Replication (SMR), underpinning these distributed architectures, commonly utilizes consensus protocols to ensure linearizable operations. These protocols are critical in cloud environments as they maintain data consistency across geographically dispersed data centers. However, the inherent latency in cloud infrastructures poses a challenge to the performance of consensus-based systems, especially for read operations that do not alter the system state and are frequently executed. This paper addresses this challenge by proposing “Asynchronous Consensus Quorum Read” (ACQR), a novel read optimization method specifically designed for asynchronous consensus protocols in cloud computing scenarios. We have incorporated ACQR into Rabia, an advanced asynchronous consensus protocol, to show its effectiveness. The experimental results are encouraging, they demonstrate that ACQR improves Rabia’s performance, achieving up to a 1.7× increase in throughput and a 40% reduction in optimal latency. This advancement represents a critical step in enhancing the efficiency of read operations in asynchronous consensus protocols within cloud computing environments.

Keywords:

distributed systems; state machine replication; asynchronous consensus; read optimization; quorum

1. Introduction

State Machine Replication (SMR) is a fundamental technique for providing available and consistent services in fault-tolerant distributed systems [1]. Imagine a system where multiple replicas of a state machine are maintained across different processes. To implement SMR, all correct processes execute the same set of requests in the same order, leading to identical states across all correct processes in the system. SMR ensures that even if some processes fail, the system remains consistent and can process requests issued by clients. SMR is foundational in the design and implementation of fault-tolerant distributed systems.

At the core of SMR is a consensus protocol that reaches an agreement on requests [2], even in the presence of failures such as crashed processes or network problems. Consensus protocols, also known as consensus algorithms, are the backbone of many modern distributed databases and other critical infrastructures, such as distributed file systems. Consensus serves as a middleware between applications and storage layers, thereby ensuring stable data management [3,4,5,6]. Figure 1 shows the role of consensus in distributed systems. Processes first receive application requests from clients and resort to a consensus protocol to locate each request in a shared log. After an order is determined, the logs are parsed into read and write operations on the underlying storage layer for persistent storage in databases.

Among existing consensus protocols, Paxos [7,8] and Raft [9] are widely used in industry. Paxos relies on a leader to make a proposal and coordinate consensus, while other processes, usually called followers, negotiate and learn the proposal. This architecture provides foundational solutions to the consensus problem and inspires some of its variations [10,11,12]. However, its complexity often makes it challenging to totally understand and implement in practical systems by system developers [9,13]. On the other hand, Raft, introduced by Ongaro and Ousterhout in 2014, is designed to be more understandable while retaining the desired properties, so it has been a popular alternative in recent years in terms of building consensus-based distributed storage services like Redis [14], CockroachDB [6], and etcd [15]. Raft continues to reinforce the role of the stable leader, while during the implementation of Raft, there are still many details affecting data consistency that need to be paid attention to, such as handling split-brain conditions.

The above two consensus protocols, along with other protocols such as Viewstamped Replication [16], are designed in the partially synchronous network model, which relies on the known timing bound of message transmission. In this model, the safety attribute called strong consistency holds irrespective of network conditions, while the liveness attribute hinges on a fixed period. In other words, these protocols guarantee that all the correct processes agree on the same order of commands. However, the protocols can always progress only in a partially synchronous network. The protocols proposed for the partially synchronous model are usually designed with a special role of a leader process [9,13]. Processes elect a sole leader entrusted with the helm of the consensus coordination. When the leader is considered failed, correct processes try to elect another process as the new leader, who then embarks on executing a fail-over protocol to retrieve the most recent state. Multi-leader [17,18] or leaderless [12,19] variants, which diverge from the above protocols, allow processes to take turns being the leader handling different consensus instances, or order requests with conflicting semantics through a coordinator. However, these variants are also bonded with the timing assumptions mentioned above and still need to accomplish fail-over protocols. In this case, system developers need to manually configure relevant variables to ensure the liveness of the system [15].

In contrast, asynchronous protocols [20,21] do not need to hold elections to decide who owns the leadership, which means they are leaderless. The lack of crucial processes helps maintain liveness under asynchronous networks, simplifying the design and implementation of systems. For example, systems based on asynchronous consensus protocols run normally even when a minority of processes fail, allowing other recovery processes to join the consensus cluster without trivial recovery protocols. It should be noted that according to the FLP impossibility theorem [22], it is impossible to design a deterministic algorithm that ensures consensus among all the correct processes in the presence of even a single failed process in asynchronous networks. Therefore, all the asynchronous consensus protocols need to import a randomization module, which means that under some circumstances, the termination of the consensus round is probabilistic rather than fixed. This may lead to an increase in the delay in completing the consensus on a specific request, which has made asynchronous consensus protocols be considered less efficient in the past few decades. Asynchronous consensus has already had mature applications in distributed system consensus under the Byzantine Fault Tolerance (BFT) model [23,24]. So, the asynchronous consensus in the Crash Fault Tolerance (CFT) model is worth exploring. However, in 2021, Rabia [25] became the first asynchronous consensus protocol that entirely designs and implements SMR to be applied in the CFT model in the past few years.

If we look back at Figure 1, we find that the execution of the read request does not change the state of the database storage. Therefore, the read request does not need to go through the consensus and run the state machine. The client can read directly from the processes, which can reduce the latency and resource consumption while running consensus instances, thereby achieving higher system throughput. In many systems, read operations vastly outnumber writes [26]. The performance of read operations, as such, can impact the overall system performance. So, how to optimize read operations is a crucial topic, especially under read-heavy workloads.

However, it should be noted that although the read operation does not go through consensus, it is still necessary to ensure the linearizability [27] of the read and write operations: Once a write operation is completed, each following read must see that state or some later state. In other words, every read must obtain the most recently updated value instead of the stale one.

Read optimization for leader-based consensus is divided into two types: read lease [28,29] and read quorum [30,31]. However, the read lease method is not suitable for all the leaderless consensus protocols, including asynchronous consensus, because the leader process is generally required to serve as the leaseholder. Similarly, the existing quorum read method cannot be directly applied to asynchronous consensus either because it cannot be ensured that each process can synchronously run the consensus instance in the same slot under asynchronous networks, which we discuss in detail in the Section 3. So, the method designed for optimizing asynchronous consensus reads has yet to be discussed and proposed.

In the following sections, we first provide models of partially synchronous and asynchronous consensus protocols, then we carry out a brief introduction about the current read optimization methods for partial synchronous consensus, analyzing why they cannot be applied to the existing asynchronous consensus. Next, we introduce ACQR (Asynchronous Consensus Quorum Read), a read optimization mechanism for asynchronous consensus protocols like Rabia. It is a novel work to optimize read operations for asynchronous consensus. We implemented and evaluated ACQR to prove its performance compared with the original implementation of Rabia.

In summary, the highlights of this work include the following:

ACQR is the first proposed read optimization method for asynchronous consensus.
Read operations can be completed simply by reading from the quorum without any special role.
The implementation and evaluation of ACQR in Rabia, the state-of-the-art asynchronous consensus protocol. ACQR improves the throughput of Rabia by up to 1.7× and reduces the optimal latency by 40%.

2. Models

Firstly, we provide abstractions for the consensus problem. To provide fault-tolerant service to clients, processes are typically configured in a multi-machine set-up, presenting outwardly as a cluster. We consider a distributed system comprising a total of N processes. It is important to note that within this cluster, a maximum of f processes may experience failures, such as crash faults or hardware malfunctions. This redundancy is established to ensure consistent service delivery to clients, as each of the N processes in the cluster maintains identical data backups.

Suppose that the entire cluster has a fixed arrangement that satisfies

N = 2 f + 1

. To clarify, the precise formulation should be

N \geq 2 f + 1

, which stipulates the minimum number of processes required in a cluster to maintain a fault-tolerant consensus in the presence of f possible process failures. This threshold is pivotal for the integrity of the consensus process. In real-world distributed system deployments, N can indeed be any integer that satisfies this inequality. Conventionally, N is chosen to be equal to

2 f + 1

in order to facilitate the design of the protocol and reduce redundancy. Even if under the failure of f processes, the consensus protocol can run effectively with the remaining

N - f = f + 1

processes. Selecting a value of N that exceeds

2 f + 1

could lead to excess in resource allocation and operational costs. In practical process cluster configurations, especially those under Crash Fault Tolerance (CFT) model in cloud computing, it is common to use the configuration of five processes [4] (accommodating for the failure of up to two processes). This strikes a balance between cost-effectiveness and reliability, preventing unnecessary overhead while implementing fault tolerance.

So, for the convenience of discussion, we assume

N = 2 f + 1

. In this case, any

q u o r u m

will contain

⌈ \frac{N + 1}{2} ⌉ = N - f = f + 1

processes. The concept of a quorum is integral to the consensus in distributed systems, representing the minimum number of processes that must concur to form a consensus. In Crash Fault Tolerant (CFT) models of consensus, the requisite quorum size is typically

⌈ \frac{N + 1}{2} ⌉

, which, for

N = 2 f + 1

, equates to

f + 1

. This threshold ensures that the agreement of

f + 1

processes achieves a determined consensus, precluding the concurrent formation of a different consensus by any other quorum. Invoking the pigeonhole principle [32], if a system were to have two distinct quorums within a cluster of N processes, the sum of their sizes would less than N, a contradiction with the fact that

⌈ \frac{N + 1}{2} ⌉ + ⌈ \frac{N + 1}{2} ⌉ > N

. Consequently, any two quorums must overlap by at least one process, thereby ensuring that any decision endorsed by a quorum is indeed validated by an adequate number of processes.

It is noteworthy that all the issues we discuss are based on the CFT background, and the Byzantine fault will not happen. Communications between all the correct processes are reliable, meaning any message transmitted between them will eventually be delivered.

The client sends a request to the process cluster with a unique identification, it can be constructed with a client ID and timestamp. Consensus aims to allow each process to execute the client requests in the same order while satisfying the following properties:

-: Safety: If two processes commit requests $r e q 1$ , $r e q 2$ at the same position, then $r e q 1$ and $r e q 2$ must be the same request.
-: Liveness: Any request made by the correct client will eventually be delivered and executed by every correct process.

Based on satisfying the above model, partially synchronous and asynchronous consensus satisfies their time assumptions.

As shown in Figure 2, there is often a stable leader in the partially synchronous consensus. Furthermore, other processes negotiate whether to accept the leader’s proposal. Because the upper limit of message transmission delay in this type of network is known, each process can find out in time whether the consensus messages of each round arrive within the specified time, thereby showing consistent behavior; in contrast, in the asynchronous consensus shown in Figure 3, processes exchange proposals with each other on separate consensus instances. If the majority of agreement is received, the process will vote “1” representing approval; otherwise, a vote of 0 represents the opposition, and then it enters the randomized binary consensus stage, which is essentially the same as the random algorithm proposed by Ben-Or [33], who states that if the process collects enough “1” votes, then it will determine the proposal as the result of the consensus instance. Otherwise, the process votes “?” if it cannot collect enough approved votes. If the process is still not sure in the next stage, then it will randomly vote “1” or “0” at the beginning of the next round through a global coin that randomly generates the same “0” or “1” value for all processes at the specific round. This algorithm’s special design is that if a process finally decides to output “1”, then other processes may not decide in the same round. However, as the number of subsequent consensus rounds increases, the final decision will be made the same. The probability of outputting “1” converges to 1. Therefore, advancing consensus under asynchronous consensus does not have a coordinated action. Nevertheless, it advances the consensus separately according to the type of votes received. At the same time, the design of the algorithm must be proved to ensure the safety of the consensus result.

3. Background and Motivation

Section 1 highlights the significance of read optimization on consensus for the system performance. However, to our knowledge, read optimization for asynchronous consensus protocols has yet to be proposed. In this section, we first introduce Rabia, the state-of-the-art asynchronous consensus protocol. Subsequently, we examine the existing read optimizations in partially synchronous consensus protocols, explaining why they cannot be applied to asynchronous consensus. This serves as the motivation for the optimization we propose on asynchronous consensus.

3.1. Rabia

The Rabia protocol presented in 2021 [25] is the state-of-the-art asynchronous consensus protocol that fully implements SMR (State Machine Replication), encompassing advantages inherent to asynchronous consensus, such as no network latency requirements and no reliance on leader election or failure detectors. As long as messages transmitted between correct processes are delivered reliably, they ensure system liveness with a probability of 1. In simple terms, Rabia achieves this by relying on the promise that every process can receive the same request with the smallest timestamp within a short time interval. Consequently, it employs random binary consensus to sort the head of the pending request queue for all processes (i.e., where the unresolved requests with the smallest timestamp are held), thus achieving good performance within a single data center.

3.2. Read Leases

One approach to optimize reads in consensus protocols is read lease [28]. The lease provides a time-bound guarantee, giving a particular process exclusive rights to handle read requests independently for a specific period. By granting a process a lease, other processes can be assured that the leaseholder has the most up-to-date data, eliminating the need for them to participate in the consensus of read requests. The lease assigns the leader with the responsibility for all the read traffic, putting heavy burdens on it, while the follower processes serve solely for fault tolerance and redundancy.

However, the introduction of the lease brings its challenges, such as assuring the validity period of a lease, managing lease renewals efficiently, and handling scenarios where the leaseholder fails. Moreover, clocks at each process must have a bounded frequency error, demanding that the CPU clock time of each process is almost accurate, which may not be satisfied by current commodity clocks [34].

In summary, while read leases are widely applied in the industry [5,9,13], there are still some issues to be cautious of during their implementation, such as the possibility of non-linearizable reads due to improper parameter configurations. The primary issue lies in that the leaseholder is typically the leader process, a role not present in asynchronous consensus, thus making this method unsuitable for read optimization in asynchronous consensus.

3.3. Read Quorum

In recent years, related works have begun exploring an alternative read optimization method apart from reading leases: reading quorum. As stated in Section 2, quorum, in consensus protocols, refers to the cluster of a majority of processes sufficient to reach a consensus.

The discussion on read optimization for the Raft algorithm [31] sheds light on the issue inherent in the naive read of a majority of follower processes: when a write request is sent to the leader process, the leader broadcasts it to all its followers, which in turn cast approval votes. Upon collecting a majority of the votes, the leader responds to the client and applies the write to the state machine on itself. However, if a new read request seeking the latest-written value is initiated at this point, all the followers have yet to receive the latest state machine update, as this write request would be sent along with the following heartbeat message from the leader to the followers. Consequently, the naive read majority retrieves stale data, failing to meet the linear read requirement.

The solution proposed by this work exploits the specialized transaction layer of CockroachDB [6]. After broadcasting the write operation to followers by the leader, if the followers have not received a heartbeat message to confirm from the leader, each follower leaves a write-intent mark at the corresponding key in its local database, indicating a potential upcoming write, albeit without actual completion. In this situation, similar to reading quorum, the client will be alerted that the latest write operation has not yet applied to the state machine of the process, thus, the client will wait for the disappearance of the write-intend marker by continuously sending probe messages before initiating a new read operation. The two issues that exist in this kind of read optimization are, firstly, for hot-spot data frequently subjected to write modifications, in which the prolonged existence of write-intend markers could decelerate read efficiency. Secondly, this method is heavily intertwined with specific database products like CockroachDB, rendering it inapplicable to asynchronous consensus.

The second read quorum method is designed for the Paxos algorithm [30]. A quorum read requires two phases. In the first phase, the client reads the most recently accepted slot from a majority of followers, excluding the leader processes. As each follower replies to the client, it returns the most recently accepted slot seen by that follower. Then, the client collects responses from the majority of followers and selects the slot with the highest acceptance number for the second phase. The result of this consensus slot will eventually be returned as the result of the read operation. In the upcoming phase, the client waits for the execution of the slot number obtained from the first phase. For this situation, the client needs to contact any process and check whether the command at that slot has been executed. Checking with just a single follower for execution confirmation is sufficient as the state of execution carries information regarding global replication in the system: a command is executed only when all commands on previous slots have been committed across the cluster. If the current slot has been executed, the client can complete the read operation and utilize the result of the command executed at that slot. Otherwise, the client must retry the second phase if the command is still pending.

This method is simple and easy to understand; however, it leverages the global synchronization guarantee of the accept phase of Paxos, thereby rendering it unsuitable for asynchronous consensus.

3.4. Other Read Optimization Mechanisms

Aside from the two mainstream methods mentioned above, there are some works compromising on non-linearizable reads: many practical system implementations delegate data reading to protocols built on top of the Paxos-supported consensus layer, such as ZooKeeper [35] and Cassandra [36].

Based on the comparisons above, we aim to design a read optimization method for asynchronous consensus.

3.5. Consensus vs. MVCC

Consensus algorithms and Multi-Version Concurrency Control (MVCC) are concepts operating at different levels. In previously proposed consensus algorithms [8,17,25] and their read optimizations [30], a slot number, which is the identifier of a consensus slot, is used to totally order all the read and write requests. A consensus slot serves as a marker of consensus progress, where different processes increment the slot number with each new request for consensus. The objective is to have all the processes agree on the same order of requests in a given consensus slot. On the other hand, MVCC is more commonly employed in managing concurrent access in database transactions. This can involve concurrent reads and writes on a single process. For instance, InnoDB [37] assigns a version number to each transaction to track updates, while TiDB [5] manages data by binding timestamps to transactions.

4. Details of ACQR

In this section, we give the details of the read optimization method for asynchronous consensus protocols, termed ACQR (Asynchronous Consensus Quorum Reads). The assumption of the ACQR algorithm is that client interactions with the replicated key-value database are characterized by read and write requests targeting single keys. Furthermore, we assume that asynchronous consensus is committed and completed in sequential slot order.

4.1. Algorithm Description

The pseudocode for ACQR is shown in Algorithms 1 and 2. The client broadcasts a request, represented as

〈 R E Q U E S T_r, r e q, c 〉

, to a quorum and waits for responses from all the members of the quorum. Depending on whether the request pertains to a read or write operation on some value, we employ two distinct types, namely

〈 R E Q U E S T_r 〉

and

〈 R E Q U E S T_w 〉

. Each request

r e q

encapsulates the specifics of the request, encompassing request type along with the information of a single key. c designates the client ID, which serves as a means for the processes to identify the original sender of the request. This facilitates the delivery of results to the designated client. In the context of Crash Fault Tolerance (CFT) model, where processes do not engage in deceitful behavior, the client ID is both unique and accurate.

The processes deal with both read and write requests. Each key

k e y

has an associated

c o m m i t_s e q []

and

s e e n_s e q []

in each process, both initialized to 0. These arrays help trace the orders each key has been accessed. The

c o m m i t_s e q [r e q . k e y]

is the highest sequence number, indicating the slot in which requests are committed in the consensus phase. When a write request

〈 R E Q U E S T_w, r e q, c 〉

is received from a client, the asynchronous consensus, such as Rabia, produces a consensus instance to process it, updating the sequence number

s e e n_s e q p [r e q . k e y]

for the key that the request accessed. Upon receiving a read request

〈 R E Q U E S T_r, r e q, c 〉

from a client, the process waits until the

c o m m i t_s e q [r e q . k e y]

meets the

s e e n_s e q [r e q . k e y]

. This ensures that the client’s read request receives the latest committed data. Once the sequences match, the committed sequence and result are returned to the client.

Upon receiving the responses, the client retrieves the result

r e q . r e s u l t

associated with the maximum

c o m m i t_s e q [r e q . k e y]

. By fetching the result with the highest sequence number, the client ensures it receives the most recent committed data modification.

Algorithm 1 Read Optimization in Asynchronous Consensus for client C.

1:: broadcast $〈 R E Q U E S T_r, r e q, c 〉$ to quorum and wait for all replies
2:: return $r e s u l t$ whose $c o m m i t_s e q [r e q . k e y]$ is the maximum one

Algorithm 2 Read Optimization in Asynchronous Consensus for process

P_{i}

.

Init:: $c o m m i t_s e q [] \leftarrow N U L L, s e e n_s e q [] \leftarrow N U L L$

1:: upon receiving write request $〈 R E Q U E S T_w, r e q, c 〉$ from client
2:: $s e e n_s e q [r e q . k e y] \leftarrow$ AsyncConsensus( $r e q$ )
3:: upon receiving read request $〈 R E Q U E S T_r, r e q, c 〉$ from client
4:: $s e q \leftarrow s e e n_s e q [r e q . k e y]$
5:: wait until $c o m m i t_s e q [r e q . k e y] \geq s e q$
6:: send $〈 c o m m i t_s e q [r e q . k e y], r e q . r e s u l t 〉$ to c
7:: upon $r e q$ is committed in consensus instance $s e q$
8:: $c o m m i t_s e q [r e q . k e y] \leftarrow$ $s e q$
9:: function AsyncConsensus( $r e q$ )
10:: return a consensus slot assigned to $r e q$
11:: end function

4.2. ACQR vs. Rabia

This subsection delineates the distinctions between ACQR and the previously proposed asynchronous consensus algorithm Rabia [25]. As mentioned in Section 1, by default, consensus algorithms like Rabia order all the read and write operations through the same (normal) path to ensure linearizability [27], guaranteeing that read operations initiated after new writes do not return stale results. In Rabia, both read and write operations need to be handled by consensus, coordinating all the processes to reach a consistent order for client requests. However, this can be unnecessary; reads of one key and writes to another do not conflict, allowing processes to satisfy linearizability without a consistent order for such requests.

The superiority of ACQR lies in its ability to ascertain the state of a key without subjecting read operations to the consensus process. Nevertheless, to uphold linearizability, ACQR mandates that read operations be informed of the most recent write operation’s placement within the latest consensus slot. ACQR tailors the interaction of read operations with the Rabia consensus, bypassing the multi-process coordination, while conventional write operations still need to go through the consensus as Rabia’s standard procedure.

Furthermore, we also compare the theoretical round complexity of both algorithms in Section 4.4, aiming to elucidate the theoretical improvement of our algorithm more clearly.

4.3. Discussion on Correctness

In this section, we provide a concise discussion on the correctness of the ACQR.

Lemma 1.

If

r e q_{2}

is a read request for a specific key k initiated at time

t_{2}

, and

r e q_{3}

is a concurrent write request for the same key, assume no other unfinished write requests existing at the same moment, and

r e q_{1}

is a prior write request completed at time

t_{1}

(where

t_{1} < t_{2}

) and no other writes completed between

t_{1}

and

t_{2}

, then

r e q_{2}

is guaranteed to read the value of either

r e q_{1}

or

r e q_{3}

.

Proof.

For a single process

P_{i}

handling

r e q_{1}

, there are two possible cases: First, the commit conditions for

r e q_{1}

are satisfied, hence the local variable

c o m m i t_s e q [k]

corresponds to the consensus slot of

r e q_{1}

. Second, the process is terminating the consensus for

r e q_{1}

, and the

c o m m i t_s e q [k]

has not yet advanced to the slot of

r e q_{1}

. In either case, a quorum has

s e e n_s e q [k]

that has reached the slot number of

r e q_{1}

, as the termination of consensus for

r e q_{1}

needs participation by a majority of processes. Consequently,

r e q_{2}

can at least ensure that the largest

c o m m i t_s e q [k]

from the quorum will not be smaller than any write requests completed prior to

r e q_{1}

. As for reading the value written by the concurrent request

r e q_{3}

, it depends on whether the quorum

r e q_{3}

read contain the process of reaching consensus for

r e q_{3}

. If a majority of processes have begun the consensus for

r e q_{3}

, since their

s e e n_s e q [k]

will be greater than the slot number of

r e q_{3}

, the processing of

r e q_{1}

will wait and return the result of

r e q_{3}

upon the completion of consensus.

To summarize,

r e q_{2}

will read the value of the most recently completed write request

r e q_{1}

or the concurrent write request

r e q_{3}

, instead of the stale value. □

Lemma 2.

If

r e q_{1}

and

r e q_{2}

are read requests for the same key k, with

r e q_{1}

completed at

t_{1}

and

r e q_{2}

starting at

t_{2}

(satisfying

t_{2} > t_{1}

), then

r e q_{2}

will not read a value older than

r e q_{1}

.

Proof.

According to Lemma 1, if

r e q_{1}

have read the value of the most recently completed write

r e q_{3}

, then at that moment, the

s e e n_s e q [k]

of a quorum of processes has reached the slot number of

r e q_{3}

, ensuring that subsequent read requests

r e q 2

reading from a quorum will at least obtain the

s e e n_s e q [k]

equal to or greater than the slot number of

r e q_{3}

, thereby preventing

r e q_{2}

from reading an older write than

r e q_{3}

. Alternatively, if

r e q_{1}

reads a value from a concurrent write request

r e q_{4}

, Lemma 1 indicates that the completion of

r e q_{1}

must occur after

r e q_{4}

’s commit. In this case, the largest

s e e n_s e q [k]

of a majority of processes is at least align with

r e q_{4}

’s slot number. In this scenario,

r e q_{1}

and

r e q_{2}

get different read results, with

r e q_{2}

accessing a value from a more recently ongoing write operation

r e q_{4}

. □

Combining these two lemmas, we can simply demonstrate that the design of ACQR satisfies the linearizability of read and write operations.

4.4. Performance Analysis

We analyze the complexity of Rabia and our proposed ACQR solution to illustrate that ACQR can achieve better performance than Rabia for read requests.

Table 1 illustrates the comparison of the number of rounds for the two methods. A simple analysis is as follows: In Rabia, when a client sends a read or write request to any process in the cluster, the process that receives the request will begin the asynchronous consensus stage. As described in the work of Rabia [25], the expected number of rounds for randomized consensus is five. In addition to two additional rounds of communication between the client and the process cluster, Rabia’s expected termination round for processing read and write requests is 7, regardless of the type of request.

As for ACQR, in the best case, if all the requests are read operations, then there is no need for a regular consensus stage between processes, so only two rounds of request-response message exchange stage between processes and clients are required.

5. Evaluation

5.1. Implementation

We implement and evaluate ACQR protocol in Golang 1.13 using the open-source framework of Rabia [25], so that we can compare the performance of the two methods with minimal change to the original implementation.

In the implementation of many modules, we do not need to modify Rabia’s code. For instance, in the communication module, the interactions between clients and processes are all facilitated through established TCP connections. Variables such as flag fields (e.g., request IDs), are defined within the protocol buffers (proto) file. Our modifications are twofold: Firstly, for read operations, clients establish TCP connections with a majority of the processes and broadcast their requests. Secondly, for write operations, we retain Rabia’s implementation, where the client initially broadcasts to the designated process, which then proposes the request, triggering the consensus stage.

5.2. Experimental Setup

We evaluate our ACQR prototype on AWS EC2 platform with Rabia and ACQR clients and processes deployed over t3.2xlarge instances with 8 vCPUs, 32 GB RAM, and 5 Gbps Network burst bandwidth, running Ubuntu 20.04.3 LTS. The workload generated for our benchmark is a simple key-value store implementation containing 1000 distinct objects. Each operation either updates (writes) or reads a 10-byte value. These two kinds of operations are generated with equal probability (i.e., 50% reads and 50% writes).

In the evaluations of Rabia, each client sends its requests to a designated process (the one located at the same LAN for the geo-replication test), which then proposes on the certain consensus instances. For ACQR, the clients send their requests to the fixed quorum of processes.

We conduct each test over 1 min and compute the median and 99%ile client-perceived latency (ms) experienced during that interval. We also compute the throughput (req/sec) during each experiment.

5.3. Latency

Initially, we assess both protocols in the context of wide-area replication. For a configuration of

N = 3

, the processes and clients are situated in California, Tokyo, and Sydney. Moreover, for the

N = 5

scenario, we incorporate locations in Virginia and Frankfurt. At each site, 10 clients are co-located with each process. Each client directs its requests to the process under the same LAN. We do not use pipelining mechanisms for wide-area network testing. We carry out closed-loop testing under the above-mentioned conditions.

As illustrated in Figure 4 and Figure 5, In both scenarios, read optimization can reduce the median latency to 60% of Rabia. Readers may notice that Rabia’s median latency has exceeded 1000 milliseconds, which is significantly high under wide-area network conditions. Here is a brief explanation: In Rabia, pipelining optimization is not enabled, meaning that at a certain time, only one request is processed at the same time, and the processing of other requests needs to wait for the consensus of the request to be completed. Under such conditions, since each write operation still needs to be encapsulated as a consensus instance, and the Rabia consensus within a wide-area network requires multiple rounds of communication, the latency to complete a write operation is about at the level of hundreds of milliseconds. Besides, the current write request also hinders the progress of subsequent write requests, as the server side can handle only one consensus instance at a time. Therefore, the calculation renders the median latency of Rabia in a wide-area network over 1000 milliseconds. Of course, it is also mentioned in the Rabia work that its design makes it more suitable for consensus within a single data center. Hence, such performance is understandable. What we primarily aim to reveal is that this kind of read optimization indeed significantly lowers the median latency within the wide-area network, and if the proportion of write operations continues decreasing, the performance improvement will become more noticeable.

5.4. Throughput

Next, we configure Rabia and ACQR to measure their peak throughput. We conduct two sets of experiments within a data center in Ohio, deploying configurations of three clients with three processes and five clients with five processes, respectively. For a fair comparison, we employ the same batching strategy for both protocols, that is, each client sends a batch of a fixed number of requests to the designated processes, and each process collects all the requests sent by clients until the batch is full or a timeout occurs, at which point the process proposes. We adopt variables that optimize Rabia’s performance: a client-side batch size of 1000 and a process-side size of 10. If the required batch size is unmet, the system utilizes a 2 ms timeout to batch the requests. Figure 6 and Figure 7 demonstrate the throughput and latency numbers by incrementing the number of concurrent closed-loop clients (3–650 for

N = 3

, 5−1000 for

N = 5

). It is evident that a substantial portion of read operations are expedited due to bypassing the consensus, resulting in overall lower latency and higher peak throughput for the system.

When

N = 5

, due to Rabia’s quadratic message complexity, Rabia’s throughput decreases, while ACQR, having a moderate proportion of read operations, is not severely impacted by the message complexity.

5.5. Varying Read–Write Ratio

We configure workloads with varying read–write ratios and deploy three clients and three processes within a data center in Ohio. For the peak throughput tests mentioned above, we examine the peak throughput of Rabia and the ACQR under these conditions by increasing the number of clients.

As shown in Figure 8, we observe that as the proportion of read operations increased, the performance of ACQR improved, with a maximum enhancement of up to 1.75 times. We believe that the performance improvement will be more significant since, in a scenario of 100% read operations, a read can be completed in one Round Trip Time (1RTT) without going through any consensus instances. The potential reason might be that a client process needs to establish connections with multiple processes, thus consuming an amount of network resources. Additionally, upon reaching peak throughput, the CPU utilization on the process side exceeded 95%, indicating that the performance of the process side in handling network requests might be a limiting factor, preventing further elevation in the performance improvement.

5.6. Latency with High-Contention Workloads

To evaluate the impact of hot-spot data access on algorithm performance, an experimental setup with

N = 3

is employed, entailing the deployment of three processes within a local area network in Ohio. Under equal read–write ratios, we vary the number of accessed keys, representing the objects of read–write operations. A smaller quantity of these keys indicates higher contention for hot-spot data. As Figure 9 shows, Rabia’s performance is unaffected by hot-spot data contention. This is attributed to its design, where all the read–write accesses to hot-spot data undergo consensus to establish a consistent order before execution. In contrast, while ACQR optimizes read operations in high-contention scenarios, determining the order of read operations following write operations still necessitates execution of write operations, potentially prolonging waiting times. However, even in the worst-case scenario, ACQR outperforms Rabia. Moreover, in the settings of lower contention, ACQR consistently demonstrates superior performance.

6. Conclusions

We propose and implement a read optimization method, ACQR, for asynchronous consensus protocols. This method enhances the throughput and latency performance in both single-data and cross-data center scenarios. Through the evaluation, ACQR prove its ability to perform well under read-intensive workloads.

We also highlight the adaptability of ACQR under various workloads and scenarios, making it a universal solution when integrated with other asynchronous consensus-based protocols. Future work could focus on refining ACQR to explore its applicability in more diverse and demanding environments.

Additionally, the potential application of ACQR in real-world scenarios such as cloud computing and big data processing is promising. With the development of asynchronous consensus represented by Rabia, we hope that ACQR can be applied in the industry to improve the performance of asynchronous consensus.

Author Contributions

H.D. and S.L. conceived and developed the methodology, H.D., as the first author of this paper, designed and executed the experiments, and performed the analysis of the results. Additionally, he is responsible for the writing of the manuscript.; S.L. helped with the revision of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the NSFC Grant 62372293.

Data Availability Statement

The data used in the paper are available from the corresponding authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schneider, F.B. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. (CSUR) 1990, 22, 299–319. [Google Scholar] [CrossRef]
Pease, M.; Shostak, R.; Lamport, L. Reaching agreement in the presence of faults. J. ACM 1980, 27, 228–234. [Google Scholar] [CrossRef]
Calder, B.; Wang, J.; Ogus, A.; Nilakantan, N.; Skjolsvold, A.; McKelvie, S.; Xu, Y.; Srivastav, S.; Wu, J.; Simitci, H.; et al. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles, Cascais, Portugal, 23–26 October 2011; pp. 143–157. [Google Scholar]
Corbett, J.C.; Dean, J.; Epstein, M.; Fikes, A.; Frost, C.; Furman, J.J.; Ghemawat, S.; Gubarev, A.; Heiser, C.; Hochschild, P.; et al. Spanner: Google’s globally distributed database. ACM Trans. Comput. Syst. 2013, 31, 1–22. [Google Scholar] [CrossRef]
Huang, D.; Liu, Q.; Cui, Q.; Fang, Z.; Ma, X.; Xu, F.; Shen, L.; Tang, L.; Zhou, Y.; Huang, M.; et al. TiDB: A Raft-based HTAP database. Proc. VLDB Endow. 2020, 13, 3072–3084. [Google Scholar] [CrossRef]
Taft, R.; Sharif, I.; Matei, A.; VanBenschoten, N.; Lewis, J.; Grieger, T.; Niemi, K.; Woods, A.; Birzin, A.; Poss, R.; et al. Cockroachdb: The resilient geo-distributed sql database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 1493–1509. [Google Scholar]
Leslie, L. The part-time parliament. ACM Trans. Comput. Syst. 1998, 16, 133–169. [Google Scholar]
Lamport, L. Paxos made simple. ACM SIGACT News 2001, 32, 51–58. [Google Scholar]
Ongaro, D.; Ousterhout, J. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14), Philadelphia, PA, USA, 19–20 June 2014; pp. 305–319. [Google Scholar]
Lamport, L. Fast paxos. Distrib. Comput. 2006, 19, 79–103. [Google Scholar] [CrossRef]
Howard, H.; Malkhi, D.; Spiegelman, A. Flexible paxos: Quorum intersection revisited. arXiv 2016, arXiv:1608.06696. [Google Scholar]
Moraru, I.; Andersen, D.G.; Kaminsky, M. There is more consensus in egalitarian parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, Farmington, PA, USA, 3–6 November 2013; pp. 358–372. [Google Scholar]
Chandra, T.D.; Griesemer, R.; Redstone, J. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing, Portland, OR, USA, 12–15 August 2007; pp. 398–407. [Google Scholar]
Redis. RedisRaft: Strongly-Consistent Redis Deployments. GitHub. 2018. Available online: https://github.com/RedisLabs/redisraft (accessed on 30 July 2023).
Gyuho, L. etcd: Distributed Reliable Key-Value Store for the Most Critical Data of a Distributed System. GitHub. 2013. Available online: https://github.com/etcd-io/etcd (accessed on 30 October 2023).
Oki, B.M.; Liskov, B.H. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, Toronto, ON, Canada, 15–17 August 1988; pp. 8–17. [Google Scholar]
Barcelona, C.S. Mencius: Building efficient replicated state machines for WANs. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 08), San Diego, CA, USA, 8–10 December 2008. [Google Scholar]
Peluso, S.; Turcu, A.; Palmieri, R.; Losa, G.; Ravindran, B. Making fast consensus generally faster. In Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, 1–28 July 2016; IEEE: Washington, DC, USA, 2016; pp. 156–167. [Google Scholar]
Enes, V.; Baquero, C.; Rezende, T.F.; Gotsman, A.; Perrin, M.; Sutra, P. State-machine replication for planet-scale systems. In Proceedings of the 15th European Conference on Computer Systems, Virtual, 28–30 April 2020; pp. 1–15. [Google Scholar]
Aspnes, J. Randomized protocols for asynchronous consensus. Distrib. Comput. 2003, 16, 165–175. [Google Scholar] [CrossRef]
Bracha, G.; Toueg, S. Asynchronous consensus and broadcast protocols. J. ACM 1985, 32, 824–840. [Google Scholar] [CrossRef]
Fischer, M.J.; Lynch, N.A.; Paterson, M.S. Impossibility of distributed consensus with one faulty process. J. ACM 1985, 32, 374–382. [Google Scholar] [CrossRef]
Miller, A.; Xia, Y.; Croman, K.; Shi, E.; Song, D. The honey badger of BFT protocols. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 25–27 October 2016; pp. 31–42. [Google Scholar]
Guo, B.; Lu, Z.; Tang, Q.; Xu, J.; Zhang, Z. Dumbo: Faster asynchronous bft protocols. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 9–13 November 2020; pp. 803–818. [Google Scholar]
Pan, H.; Tuglu, J.; Zhou, N.; Wang, T.; Shen, Y.; Zheng, X.; Tassarotti, J.; Tseng, L.; Palmieri, R. Rabia: Simplifying state-machine replication through randomization. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual, 26–29 October 2021; pp. 472–487. [Google Scholar]
Shute, J.; Vingralek, R.; Samwel, B.; Handy, B.; Whipkey, C.; Rollins, E.; Oancea, M.; Littlefield, K.; Menestrina, D.; Ellner, S.; et al. F1: A Distributed SQL Database That Scales. Proc. VLDB Endow. 2013, 6, 1068–1079. [Google Scholar] [CrossRef]
Herlihy, M.P.; Wing, J.M. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 1990, 12, 463–492. [Google Scholar] [CrossRef]
Gray, C.; Cheriton, D. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. ACM SIGOPS Oper. Syst. Rev. 1989, 23, 202–210. [Google Scholar] [CrossRef]
Moraru, I.; Andersen, D.G.; Kaminsky, M. Paxos quorum leases: Fast reads without sacrificing writes. In Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, 3–5 November 2014; pp. 1–13. [Google Scholar]
Charapko, A.; Ailijiang, A.; Demirbas, M. Linearizable quorum reads in Paxos. In Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19), Renton, WA, USA, 8–9 July 2019. [Google Scholar]
Arora, V.; Mittal, T.; Agrawal, D.; El Abbadi, A.; Xue, X. Leader or majority: Why have one when you can have both? Improving read scalability in raft-like consensus protocols. In Proceedings of the 9th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 17), Santa Clara, CA, USA, 10–11 July 2017. [Google Scholar]
Ajtai, M. The complexity of the pigeonhole principle. Combinatorica 1994, 14, 417–433. [Google Scholar] [CrossRef]
Ben-Or, M. Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols. In Proceedings of the Second Annual ACM Symposium on Principles of Distributed Computing, New York, NY, USA, 17–19 August 1983; pp. 27–30. [Google Scholar] [CrossRef]
Geng, Y.; Liu, S.; Yin, Z.; Naik, A.; Prabhakar, B.; Rosenblum, M.; Vahdat, A. Exploiting a natural network effect for scalable, fine-grained clock synchronization. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, USA, 9–11 April 2018; pp. 81–94. [Google Scholar]
Hunt, P.; Konar, M.; Junqueira, F.P.; Reed, B. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC 10), Boston, MA, USA, 23–25 June 2010. [Google Scholar]
Lakshman, A.; Malik, P. Cassandra: A decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 2010, 44, 35–40. [Google Scholar] [CrossRef]
Bannon, R.; Chin, A.; Kassam, F.; Roszko, A.; Holt, R. Innodb Concrete Architecture; University of Waterloo: Waterloo, ON, Canada, 2002. [Google Scholar]

Figure 1. Consensus as a middleware for handling the consistency of data backups.

Figure 2. Message pattern for partially synchronous protocols when

f = 1

.

Figure 2. Message pattern for partially synchronous protocols when

f = 1

.

Figure 3. Message pattern for asynchronous protocols when

f = 1

.

Figure 3. Message pattern for asynchronous protocols when

f = 1

.

Figure 4. Median and 99%ile latency in different regions when

N = 3

.

Figure 4. Median and 99%ile latency in different regions when

N = 3

.

Figure 5. Median and 99%ile latency in different regions when

N = 5

.

Figure 5. Median and 99%ile latency in different regions when

N = 5

.

Figure 6. Latency–throughput changes in a single data center scenario when

N = 3

.

Figure 6. Latency–throughput changes in a single data center scenario when

N = 3

.

Figure 7. Latency–throughput changes in a single data center scenario when

N = 5

.

Figure 7. Latency–throughput changes in a single data center scenario when

N = 5

.

Figure 8. Peak throughput of changing load read–write ratio in a single data center scenario.

Figure 9. Latency with high-contention workloads when

N = 3

.

Figure 9. Latency with high-contention workloads when

N = 3

.

Table 1. Number of rounds in the best case.

Methods	Number of Rounds
Rabia	7
ACQR	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, H.; Liu, S. Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols. Electronics 2024, 13, 481. https://doi.org/10.3390/electronics13030481

AMA Style

Dong H, Liu S. Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols. Electronics. 2024; 13(3):481. https://doi.org/10.3390/electronics13030481

Chicago/Turabian Style

Dong, He, and Shengyun Liu. 2024. "Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols" Electronics 13, no. 3: 481. https://doi.org/10.3390/electronics13030481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus Protocols

Abstract

1. Introduction

2. Models

3. Background and Motivation

3.1. Rabia

3.2. Read Leases

3.3. Read Quorum

3.4. Other Read Optimization Mechanisms

3.5. Consensus vs. MVCC

4. Details of ACQR

4.1. Algorithm Description

4.2. ACQR vs. Rabia

4.3. Discussion on Correctness

4.4. Performance Analysis

5. Evaluation

5.1. Implementation

5.2. Experimental Setup

5.3. Latency

5.4. Throughput

5.5. Varying Read–Write Ratio

5.6. Latency with High-Contention Workloads

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI