Abstract

With the development of mobile technology, mobile virtual worlds have attracted massive users. To improve scalability, a peer-to-peer virtual world provides the solution to accommodate more users without increasing hardware investment. In mobile settings, however, existing P2P solutions are not applicable due to the unreliability of mobile devices and the instability of mobile networks. To address the issue, a novel infrastructure model, called Virtual Net, is proposed to provide fault-tolerance in managing user content and object state. In this paper, the key problem, namely, object state update, is resolved to maintain state consistency and high interaction responsiveness. This work is important in implementing a scalable mobile virtual world.

1. Introduction

Virtual worlds, including multiplayer online games and virtual social worlds, allow users to inhabit in virtual environments, create their own content, and interact with each other. Mobile virtual worlds allow users to access the simulated environments through mobile devices, achieving the possibility to play anywhere. Mobile virtual worlds have gained large attraction from the development of mobile devices. They have become an important market and revenue source for the game industry and attracted a large number of users. For example, Fortnite has earned $1,996,917 gross daily revenue [1] and reported 3.4 million concurrent players [2] in 2018. The success and expansion of mobile virtual worlds raise new challenges in infrastructure development; one of them is the scalability problem. In virtual worlds, interaction is implemented by sending events to servers for processing and receiving updates from the servers for rendering and state synchronization. With the increase of concurrent online users, more computing load is imposed on game infrastructures. Servers have to process and respond to more client requests within a short period for high responsiveness. Also, network bandwidth consumption is increased to pack multiple game states in an update. For scaling, more computing resources have to be invested. Otherwise, user experience will be affected.

Peer-to-peer (P2P) virtual worlds, firstly introduced in [3], explore the possibility of running a virtual world without a central server. In P2P virtual worlds, user devices run both the client program and server program for event handling and state update. Thus, computing resources naturally scale along with the change of user population. Mobile applications, however, have different characteristics with respect to their desktop counterparts. One outstanding issue is client failure. Compared to desktop PCs, mobile devices are more prone to failure, due to, for example, battery depletion or application conflict. Moreover, the access to mobile networks, such are MANETs and VANETs, is also unstable. Client unreliability may cause content loss or state inconsistency, if user content and object state are not properly saved or backed up before failure. Yet, existing P2P virtual worlds do not concern the peer device unreliability problem [4]. Thus, they cannot be directly applied in mobile settings.

In this paper, a Virtual Net model is proposed to address the client unreliability problem for mobile P2P virtual worlds. The model utilizes the cloud-fog structure, but totally decentralized. To avoid content loss, the cloud layer stores user contents for content persistency. The fog layer caches object states for client recovery and maintains state consistency. The separation of content storage and state caching can improve responsiveness, since operations direct on P2P storage have more communication overhead [5]. Based on the P2P content storage, a content addressing scheme is devised, which can facilitate content integrity check.

To avoid reinventing the wheel, this paper mainly focuses on the state update problem to maintain object state consistency. At the fog layer, object states are replicated on several nodes for fault-tolerance. Thus, all replicas must maintain the same state in event handling so that interaction can be performed within a consistent shared environment. Yet, the requirement of high responsiveness in virtual world interaction makes the problem difficult. To attack the difficulty, an opportunistic approach, called fast event delivery, is proposed. Based on the approach, a virtual world interaction model is then designed. In short, the main contributions of the paper are listed as follows.(1)A new P2P cloud-fog structure, called Virtual Net model, is proposed to resolve the client unreliability problem, which can provide fault-tolerance in playing a mobile virtual world.(2)A fast event delivery approach is proposed to maintain both replica state consistency and high responsiveness in the process of handling user events.(3)A new virtual world interaction model is designed to achieve game state consistency and high responsiveness when interacting with different neighbors.

The remainder of the paper is organized as follows. The related works are introduced in Section 2. The overall Virtual Net model is described in Section 3. Section 4 studies the state update problem in detail. Based on the solution of the problem, the virtual world interaction model is provided in Section 5 with neighbor change management. The correctness of the solution is proved in Section 6. Sections 7 and 8 evaluate the performance through theoretical analysis and experiments. Section 9 concludes the paper.

Mobile P2P virtual worlds combine the characteristics of mobile virtual world and P2P virtual world problems. Due to the lack of study in this field, the related work in P2P virtual worlds and cloud-fog mobile applications is surveyed to shape the distinct characteristics of the combined problem.

2.1. P2P Virtual Worlds

P2P MMORPGs and P2P virtual environments have been amply surveyed in [4, 6]. Previous works mainly focus on inter-player consistency management, including peer connectivity, interest management, event dissemination, and cheat prevention. Peer connectivity [7] studies the connection of all user devices within an overlay network such that any peer can be reached from another peer. Interest management [8] restricts the range of message receipt to reduce communication overhead in state update. Event dissemination [9] reduces the number of communication channels on event senders to avoid overwhelming them in hotspot areas. Cheat prevention [10] is needed to achieve fairness without the arbitration from a central server. In these works, a desktop environment is assumed such that a client is always reliable in storage and connection. In contrast, the Virtual Net targets the mobile environments in which both devices and connections are unreliable, which is the new problem and orthogonal to the above studies. Thus, a complete implementation of Virtual Net can employ existing P2P solutions in inter-player consistency management, such as peer connectivity and interest management, to avoid reinventing the wheel.

Early work on P2P state persistency is related to the content storage in this work. State persistency studies the reliable storage and efficient retrieval of user state [11]. Each time a state is updated, it has to be persisted in the overlay network, and the state has to be queried from the overlay network when the client is recovered from a failure. Same as the above argument, the work in [11] only assumes a reliable client, which is not applicable in a mobile setting. The Virtual Net model not only solves the unreliable client problem, but also reduces storage and retrieval overhead through content caching. Moreover, content integrity check is included in Virtual Net, which is not mentioned in previous works.

2.2. Fog Computing

Firstly introduced in [12], cloud gaming moves the game engine functions to the cloud to simplify development, distribution, access, and update [13]. However, the measurement study [14] shows that the current cloud gaming infrastructure is unable to meet the latency requirement for end-users distant from data centers. To improve latency, fog computing [15] has been introduced to move the time-critical functions to the locations near clients. Fog computing has been widely discussed in both Internet of Things (IoT) [16] and mobile computing [17] to offload server burden [18], enable location awareness, and provide real-time interaction. Among its many applications, mobile gaming [19] and mobile reality [20] are two important examples. Similar to the cloud-fog structure, the Virtual Net solution also employs the cloud layer for content storage and the fog layer for latency improvement. But differently, Virtual Net explores a totally decentralized solution, with no central control at the cloud layer.

3. Virtual Net Model

The proposed Virtual Net structure is based on the commonly used three-layer structure shown in Figure 1. Similar to some existing cloud-fog structures, it is divided into three layers: the cloud layer (L1), the fog layer (L2), and the client layer (L3). The cloud layer provides persistency service, which stores the files of user content and the state of virtual objects (avatars, accessories, achievements, etc.). The fog layer caches object states in play and provides state recovery for clients in case of short-term failure. It also periodically checks object states and saves them to the cloud layer for state persistency, which is asynchronous to event handling. When a user leaves a game, the cached state of user object will be saved to the cloud layer. The client layer provides user interfaces for receiving user operations and displaying updated states for user interaction. Virtual worlds are latency-sensitive applications. Yet, on one hand, clients require fast state update in user interaction [21]. On the other hand, the complexity of peer-to-peer routing slows down the process of content storage and retrieval [22]. Thus, the fog layer is padded between L1 and L3 to improve responsiveness in fault-tolerance.

The three-layer architecture is resilient. First, L1 and L2 can be individually scaled without affecting each other, since they are built for different purposes. The cloud layer focuses on the long-term storage of user content, which is only accessed at user login, logout, and periodic state checkpoint by the fog nodes. On the other hand, the fog layer maintains the latest state of user content and provides state recovery from intermittent client failure. Except for state initialization and checkpoint, L1 and L2 do not need to interact with each other. Besides, the model provides some extent of isolation of failure. The failure of one layer can be recovered by another layer, since each layer has a separate copy of content.

Different from the existing cloud-fog computing paradigm, computing resources in the cloud and fog layer are P2P nodes, like BitTorrent or eDonkey. Specifically, users contribute part of the computing resources from their devices which can be smartphones, laptops, desktop PCs, or even servers. A device is divided into one or several virtual nodes [23] for fine-grained load balancing. All virtual nodes are managed by a node pool. For different computing purposes, there are two types of virtual nodes: storage nodes and cache nodes. The storage nodes construct the cloud layer and the cache nodes construct the fog layer. Thus, Virtual Net is a decentralized computing paradigm. A client could be on the same device of a virtual node, like BitTorrent, or on a separate lightweight device.

3.1. P2P Cloud Layer

Object files are stored on the cloud layer through P2P file storage. Based on the file storage system, a content addressing scheme is devised, which can not only provide flexibility in content identification and addressing but also provide integrity in object management.

3.1.1. File Storage

The TotalRecall [5] storage architecture is applied to manage the storage nodes for file storage. The details of the design and performance can be found in [5]. Here, only the overall mechanism is introduced. In TotalRecall, each node is assigned a unique hash code as the node ID. Also, each file has a file ID which is the hash checksum of the file. When a new file is created, the file is associated with a storage node, called the master node whose ID is closest to the file ID. Other nodes hosting the data of the file are called host nodes. Master nodes manage the location of host nodes and the version control for the associated files. Each storage node can be the master node for some files and the host node for other files. Thus, the entire storage node network forms a distributed hash table (DHT) for file lookup. To request a file, its master node is found first with the file ID. Then, based on the reply from the master node, the host nodes are located and the file can be retrieved (or reconstructed).

3.1.2. Content Addressing

To retrieve the objects from the cloud layer, object content needs to be identified and addressed. A hierarchical content addressing scheme is devised, which can facilitate content integrity check. The devised content addressing scheme has four hierarchies: inventory, objects, components, and files, as illustrated in Figure 2.

Inventory-level: each user has an inventory file, identified by the inventory ID which is the hash code of the user ID. An inventory contains all the object descriptions, consistently managing content identification and modification. Thus, to retrieve the object contents, the inventory file needs to be retrieved first. Object-level: an object is identified by the object hash code and composed of one or multiple components. Component-level: each component is identified by the component hash code. Object components are the categories of object resource files, which are classified into animation, sound, texture, script, etc. File-level: the actual files of objects are addressed by file IDs in object descriptions. Through files ID, the actual file can be retrieved either from the local cache or from the DHT of the cloud storage.

Based on the structure of the content addressing scheme, a Merkle tree [24] (Figure 2) can be hierarchically constructed with the file hash code, component hash code, object hash code, and inventory hash code. With the Merkle tree, the integrity of user content can be recursively checked and the number of hash comparisons can be largely reduced [25]. Typically, a client caches more than 500,000 files of user-created contents [26]. Thus, an exhaustive search of updated files will be inefficient. We conduct an experiment with 200 objects and more than 5,000 files. Compared with the file-level and object-level content integrity check [26], Figure 3 shows that the proposed four-level content integrity verification has fewer hash comparisons, especially with respect to a small number of file changes.

3.2. P2P Fog Layer

The fog layer is added between the cloud layer and the client layer to mask the latency of content storage and meanwhile provide fault-tolerance. From the user perspective, each user is allocated some cache nodes, when he/she is playing in a virtual world. These cache nodes provide the user some computing resources, forming a logical computing unit. We call it mesh computer, as illustrated in Figure 4. When a user logs to the system, his/her client firstly initializes the mesh computer by requesting for some cache nodes from the node pool. The cache nodes then retrieve the content from the cloud layer. The client also retrieves the saved content from the cloud layer for content rendering and state synchronization. When the mesh computer receives a quit instruction from the client or the client is experiencing a long-term failure, the mesh computer will release the cache nodes to the node pool. Optimal resource allocation and cost minimization have been studied in [23], which is out of the scope of this paper.

Due to the unreliability characteristic of P2P nodes, they are subject to (either temporarily or permanently) failure. Thus, for reliability purpose, a mesh computer maintains multiple cache nodes which are the replicas of the same user content, called a replica group. Content will be transferred from failed nodes to live nodes. Replica group management has been studied in our previous work [27]. This paper focuses on replica state management in the following sections.

4. Object State Update

At the fog layer, it is important that all replicas of the same group maintain the same state of user objects so that any failure of a replica will not invalidate a user’s current state. The problem becomes challenging, since replicas could receive different sets of concurrent events from different senders and events could be received in different orders. The state machine replication (SMR) [28] approach is adopted to manage object state. SMR is a fault-tolerance model replicating a deterministic finite state machine on a set of distributed nodes, each of which has the same input, output, and state transfer. In an asynchronous cycle, firstly, clients send requests to all the nodes. On receiving the requests, a consensus protocol is triggered to determine the sequence of requests. Then, all nodes process the requests in the decided sequence so that they can reach the same new state. To reduce communication overhead, one replica is elected as the leader, coordinating membership reconfiguration and request ordering.

In virtual worlds, however, the consensus process adds a large delay in user interaction, because a requested event must be agreed on by all replicas after at least two communication rounds (i.e., four communication steps) [29] to reach an agreement, before it can be handled and replied to clients. The interaction delay issue in event handling is addressed based on the following observations. Due to users’ limited perception range and motion speed, the number of event senders within a small period is fixed. Thus, the number of concurrent event senders within the period can be known a priori. Based on this observation, we propose a fast event delivery approach.

4.1. Fast Event Delivery

Fast event delivery allows a replica to directly deliver a received event through a cycle-event mapping, if it can ensure that the same event will eventually be delivered by all replicas. Specifically, the timeline is divided into infinite cycles of length ∆t. From cycle c0, an event sender s periodically broadcasts an event to the replicas in each cycle. Each event is identified by the sender ID and the sequence number. The sequence number of the first event at cycle c0 is 0. If there is no operation, s just broadcasts a no-op event. At the receiving end, c0 is also known by all replicas. Each replica delivers an event with sequence number c - c0 from s for cycle c, which is called the event of cycle c. Events for cycle c from different senders will be ordered by sender ID. If a replica does not receive the event for cycle c from s, it will start an instance of consensus for the cycle. In the consensus, if a replica has received the event for cycle c, that event will be decided by the leader and delivered by all replicas. Otherwise, they will decide and deliver an empty event for cycle c. Events will be delivered to a queue Qd first and then sent to the application from the queue for handling in sequence.

The relation of cycle (ci), sender ID, event sequence number, delivery queue (Qd), and delivery sequence (λ) are illustrated in Figure 5. Specifically, in Qd, the subscript of event e denotes the event sequence number which is equal to the event sending cycle. Thus, for the same cycle, the events in Qd are sorted by the sender ID and mapped to the local index numbers in Qd (i.e., the second member in the tuples of Qd). λ represents the global index of events delivered to the application, which will be introduced in Section 4.1.4.

Figure 6 illustrates the fast event delivery process with one sender s and three replicas r1, r2, and r3 in the replica group g. s broadcasted four events, e1, e2, e3, and e4, at cycles c1, c2, c3, and c4 to g. All replicas received e1 at cycle c2 and e4 at c5. Only replica r1 received e2 at c3. No replica received e3 at c4, but r1 and r2 received e3 at c5. r1, r2, and r3 deliver e1 for c1. They then collectively decide e2 for c3 and an empty event for c4 through consensus. The first problem is how to decide c5 and e3. According to the cycle-event mapping principle (i.e., one-cycle-one-event), the replicas should only deliver e4 for c5 and discard e3, leading to event loss. Before discussing the late events handling problem in detail, the settings and assumptions of the system will be introduced first.

4.1.1. Settings and Assumptions

For fault-tolerance, a replica group contains at least n replicas. The minimal group size n is determined by the content availability requirement [27] and the replica failure rate. To reduce replication overhead, each group also has an extra number e of nodes for lazy repair [5]. Once e + 1 replicas fail, new replicas will be added to recover the group size to n + e.

In each replica group, there is one non-replica node monitoring the state of all replicas, called Rendezvous [30]. A Rendezvous uses timeout to determine the state of replicas and then broadcasts their states to all replicas. Monitoring replica state is implemented by exchanging heartbeat messages between a replica and a Rendezvous. If the Rendezvous does not receive one heartbeat message within a cycle, the replica is treated as failed and removed from the group. New replicas are also added by the Rendezvous, once the group size is smaller than n. Rendezvouses are reliable nodes, or called super-peers [19], since the existence of a group is determined by the Rendezvous. Once a Rendezvous fails, a new Rendezvous must be assigned to the replica group, which then rebuilds the replica group and recovers the object states from the cloud storage. By exchanging heartbeat messages, each replica learns the current membership of the group g, denoted by G, which contains all live replicas of group g. When the Rendezvous tells a replica that a member has failed or a new member is added, the replica will remove the member from G or add the member into G.

The system is assumed to be live. The SMR model contains three types of group-wide activities: leader election, group reconfiguration, and consensus. The liveness assumption ensures that when an activity is needed, it will eventually succeed after a finite number of failures.

Each replica group maintains a set of event senders. It is assumed that each event sender has an ID which is globally unique and sender IDs are comparable. Moreover, a replica group will append the join timestamp to sender IDs to distinguish a sender in two different joins of a sender set.

Let s be the ID of an event sender, c be a cycle number of a replica group g, and ri be the ith replica in group g. Some important relations of event, event sender, event sequence number, and cycle number are defined in Table 1. Other notations used throughout the subsequent sections are listed in Table 1. Besides, event names are capitalized.

4.1.2. Late Event Handling

An event is late if the event of cycle c is received after c on all replicas. Formally, a late event e satisfies Seq(e) = Seq(s, c) ReceiveCycle(e) > c on ri∈ G and Seq(e) = Seq(s, c) (ReceiveCycle(e) > c ReceiveCycle(e) = ) on rj∈ G. For example, e3 in Figure 6 is a late event.

To ensure the agreement of cycle event delivery on all replicas, a late event can be simply discarded, since any event can be re-sent by a client with a new sequence number if the client does not receive the reply for the event for a period. However, if a sender’s clock is temporarily out-of-sync with the replicas’ clock or a large sending delay is experienced, a large number of events could be discarded and need to be re-send, as shown in Figure 7.

To address the late event handling problem, a dynamic cycle event delivery approach is proposed, which includes two conditions for late event delivery. The purpose of the approach is to minimize the number of event discards, and meanwhile each replica can decide the delivery of late events with only local information. Below are the conditions of late event delivery. In short, only late and out-of-order events will be discarded.(1)At cycle c, all events from sender s with sequence number less than c will be deliverable. Formally, ∀e(s, j1), e(s, j2), …, e(s, jn), (2)At cycle c, an event will be nondeliverable, if one of its subsequent events has been delivered before c. Such event is a late and out-of-order event. Formally, ∀e(s, j),

To implement the dynamic cycle event delivery approach, the lowest deliverable sequence number from any sender needs to be determined first for all cycles. Specifically, at cycle c, let MinSeq(s, c) be the lowest sequence number of all undelivered events from sender s and MinSeq(s, c) ≤ seq(s, c). Also, let MaxSeq(s, c) be the sequence number of the last delivered nonempty event in cycle c from s. Then, MinSeq(s, c) = MaxSeq(s, c 1) + 1, where MaxSeq(i, c 1) is determined by the event delivery for cycle c - 1. Define the set of expected deliverable events from s at cycle c by Ω(s, c) = [MinSeq(s, c), Seq(s, c)] and the set of actual received events Π(s, c). The actual deliverable events from s at cycle c can be filtered by , which excludes the late and out-of-order events.

4.1.3. Total-Order Event Delivery

Total-order event delivery is the key mechanism in object state update to ensure that all replicas in the same group can reach the same state along the same path of state transfer, if no more event is received. By applying the dynamic cycle event delivery approach, the event delivery for one cycle is described in Algorithm 1, where E(c) stores the events from the consensus for cycle c and γ is the event delivery index in cycle c. γ is calculated by . (c, γ) and the calculation of γ ensure that the events in Qd are sorted first by cycle number, then by sender index, and lastly by event sequence number, which can sort all events in the same order on all replicas.

1. For cycle c,
2. If E(c) ≠ ∅, then
3.  
4.  c c + 1
5. Else,
6.  For ∀s ∈ S,
7.   If Π(s, c) = Ω(s, c), then
8.    Qt Qt
9.   Else,
10.    Qt
11.    Consensus for (c, )
12.    End the loop
13.  If Qt∅, then
14.   Qd Qd Qt
15.   c c + 1

In a run for cycle c, each replica firstly checks whether there are any events decided for the cycle from any consensus instance. If they exist, these events will be directly moved to Qd for event handling by the application. Otherwise, Algorithm 1 checks the condition Π(s, c) = Ω(s, c) for each sender s to ensure whether all expected deliverable events have been received. If there is an expected deliverable event not received in this cycle, then the replica will trigger a consensus instance to determine the event delivery for cycle c. Note that the cycle number c will only be increased if the events of the cycle have been delivered.

The proposed consensus algorithm is described in Algorithm 2 and illustrated in Figure 8. The consensus request is composed of the cycle number c and the sequence number of all the expected deliverable events in c from all senders. By receiving the consensus request, each replica proposes the actual deliverable events to the leader. If a replica does not receive an expected event, it will propose for the event. On receiving all proposals, the leader then decides the event for each sender and each expected sequence number. If at least one replica proposes a non- and nonempty event e(s, j) for j ∈ Ω(s, c), then e(s, j) will be decided for sequence number j from s. Otherwise, an empty event will be decided for the slot. After the events of all slots have been decided, the leader will broadcast the decision to all replicas, and they will move the decided events to E(c) for event delivery after receiving the decision. It is assumed that reliable point-to-point and multicast communication channels [29] are applied in the consensus protocol.

1. Given (c, ),
2. ri proposes:
3.
4. rL decides:
5.   For each and j∈Ω(s, c),
6.   // Proposals: the set of proposed events for c from all replicas;
7.     If ∄e(s, j) ≠ ⊥ ∧ e(s, j) Proposals, then
8.      e(s, j) ←Empty
9.      
10.
11. rn applies the decision: E(c) ←D(c)

The complete algorithms of total-order event delivery, including event collection, event delivery, and consensus, can be found in Appendix A.

4.1.4. Garbage Collection

To avoid buffer overflow, events which have been delivered to the application need to be removed from Qd to avoid Qd from unlimited growth or even overflow. Due to asynchronous event handling, however, a replica ri cannot safely remove a delivered event only with local information, because another replica rj later may request the removed event for a given cycle if rj does not receive it. Thus, determining the removable events is a challenging problem.

Learned from Algorithm 1, since all events in Qd can be uniquely identified by (c, γ), there exists a relation that maps each (c, γ) to a unique integer number λ, called delivery sequence. Thus, events in Qd can also be identified by (λ, e) and each (c, γ, e) and (λ, e) has a one-to-one mapping for the same e. Let Qd be a sequence eiei+1... ej...ek mapped to λi λi+1λjλk. It can be observed that the events which can be safely pruned satisfy the following characteristics.(1)If ej can be pruned from Qd, then all events before ej can all be pruned from Qd.(2)Let L = eiei+1...ej be the subsequence of Qd. If L can be pruned from Qd on one replica, then L can be pruned from Qd of all the replicas in group g.(3)If ej can be pruned from Qd, then λjλc, where λc is the sequence of the last event handled by the application, returned by the function LastApplied(Qd).(4)Trailing rounds with empty events cannot be removed from Qd, since the events of these rounds may be used in consensus for deciding the value of late events.

and imply that there is a common latest applied event ecle such that all the events delivered before ecle (including ecle) have been applied on all the replicas, whereas the events after ecle are undecidable. The delivery sequence of ecle is denoted by λcle. implies that λcle cannot exceed λc. restricts the range of garbage collection. Thus, all the events before λcle, which are not empty trailing events, can be safely removed from Qd.

Based on the above observation, a gossip protocol is devised to learn λcle by exchanging λc of all replicas for safely estimating the earliest removable event, which is described in Algorithm 3. In the protocol, each replica periodically sends its λc to other replicas. A replica also caches the received λc. Based on the latest received λc from all replicas, λcle can be determined by the minimal λc. Then, λcle is adjusted to exclude the trailing empty events (Lines 12-16). Lastly, all the events before λcle are removed from Qd. In Algorithm 3, Λ caches the received λc’s from all replicas in G. For a λc from ri, if λc is greater than the cached value of ri in Λ, the new λc can safely replace the existing one since λc from the same replica are monotonically nondecreasing.

1. On replica ri:
2. Upon Timer TIMEOUT
3.  λc←LastApplied(Qd)
4.  Broadcast λc to all r∈G
5.  Reset Timer
6.
7. On replica rj:
8. Upon λc from
9.  
10.  If , then
11.   
12.   If () = Qd.last // last event in Qd
13.    Map λcle to (c, γ)
14.    While
15.     λcle ←λcle -  
16.     c c - 1
17.   Qd Qd
4.1.5. Time Synchronization

In the fast event delivery approach, another key component is the synchronization of the start and end time of a cycle on event senders and recipients (all the replicas in group g) to minimize the chance of handling late events through consensus. Specifically, let ∆tn be the amount of network latency. Assume that the upper bound and lower bound of ∆tn, denoted by ∆TnL and ∆TnH, can be estimated such that most ∆tn falls within the range [∆TnL, ∆TnH]. Then, cycle length ∆t can be determined by ∆t = (∆TnH – ∆TnL).

Let tstart,s be the start time of the first cycle for event sender s designated by the replicas. Also, let tsend,s(n) be the send time of the nth event from s to the replicas. s firstly calculates tsend,s(1) by tsend,s(1) ≤ tstart,s – ∆TnL. Then, in the nth cycle, it calculates the sending time of the nth event by tsend,s(n) = tsend,s(1) + (n – 1∆t. At the receiving end, all replicas are timed to receive the nth event from s at trecv,s(n) = tsend,s(n) + ∆t = tstart,s + n·∆t. To timely collect received events, event collection and event delivery can be run by different threads with different buffers (see Appendix A for details).

A time server, such as a NTP server [31], can be deployed to the system for synchronizing the clock of event senders and recipients, which can improve the performance of the system.

4.2. Leader Election and Group Reconfiguration

Once the leader fails, a new leader is elected through leader election. Since both group reconfiguration and event handling rely on group leader, leader election has the highest priority in the three routines. It interrupts any ongoing group reconfiguration or event delivery process. The leader election criterion is replica age, which increases by one after each group reconfiguration. Based on the assumption that node failure rate increase with time, the new leader is the youngest replica in the group.

Group reconfiguration adds new members for fault-tolerance. A group reconfiguration will be triggered once the group size is lower than n and recover group size to n + e. Group reconfiguration has higher priority than event delivery, so that new members can be quickly added to a group. After group reconfiguration, the leader will also notify all senders of the new configuration.

In both leader election and group reconfiguration, the leader will decide the current states, namely, Qd, E, and G, and synchronize them to all replicas so that all replicas will load the same state after a leader election or a group reconfiguration, which is called state synchrony. For new replicas, the application state, the sequence of the last applied event λc, the time of the first cycle t0, the start time tstart,s for each sender s, and the sender set S are also synchronized from the leader for initialization. The detailed algorithms of leader election and group reconfiguration are in Appendix B.

5. Virtual World Interaction

Virtual world interaction describes how users manipulate the state of virtual objects and perceive the state change in a shared simulated environment. Following the definition, a virtual world interaction includes two steps. First, a user modifies the state of an object through operations. Second, the new object state is synchronized to other interested users. This section extends the proposed event delivery approach for supporting interactions in Virtual Net.

5.1. Flow of Events and Updates

An object is replicated on multiple hosts (i.e., clients and mesh computers) if multiple users operate the same object. To facilitate object state consistency management in interaction, the copies of objects are classified into authoritative copies and nonauthoritative copies. Each object can only have one authoritative copy but multiple nonauthoritative copies. An authoritative copy is maintained by one mesh computer, e.g., the object owner’s. The nonauthoritative copies are maintained by the clients and the mesh computer of other interested users for fault-tolerance. Interest management has been intensively studied in [8] and thus is not discussed in this paper. It is only assumed that a user’s interest scope is determined by his/her perception range in a virtual world, as illustrated in Figure 10.

By distinguishing authoritative copies from nonauthoritative copies, object state management is simplified to managing the state of an authoritative copy and synchronizing the updated state from the authoritative copy to nonauthoritative copies. For managing the authoritative copy, since the data is replicated to multiple nodes in a mesh computer, the fast event delivery approach is applied for maintaining the same state among these replicas.

From the perspective of an authoritative copy, an interaction includes receiving the event from one client, handling the event after it is delivered to the application, and multicasting the updated state to all interested hosts. Figure 9 illustrates the flow of events and updated states. The events of an object are only sent from the clients to the mesh computer which maintains the authoritative copy of the object, while updated states are broadcast by the mesh computers to all the nonauthoritative copies. To support interaction, each mesh computer maintains two sets: the event sender set S and the update recipient set U.

5.2. Neighbors

To reduce overhead, each user only communicates with a limited number of peer users, called the neighbors. Due to user mobility, a user’s neighbor may be frequently changed. A neighbor join happens when another user enters the perception range of a user. Likewise, a neighbor leave happens when a neighbor moves out of a user’s perception range. For neighbour change, the key problem is to determine the same cycle of neighbor change on all replicas for their agreement on the cycle events. The join/leave cycle can be simply synchronized through consensus. However, high neighbor dynamics will increase the number consensus, resulting in high communication overhead and high interaction latency.

To apply the fast event delivery approach in neighbour change, the connectivity maintenance approaches of mutual notification [6] are employed. Specifically, two types of neighbour are introduced:(1)Perception neighbor set (Np): the set of users and their virtual objects appearing in the perception range of a user.(2)Connectivity neighbor set (Nc): the set of users logically connected to the user.

Assume each user maintains a set of connectivity neighbors Nc. How to achieve it in a P2P virtual world can be found in [6]. A user (called User i) periodically exchanges its perception neighbor set Np with the connectivity neighbors. Once a connectivity neighbor finds that another user should/should not be in Np, it will notify User i. To facilitate description, some abstract functions are introduced:(i)Multicast(e, y, g): event e with sequence number y is sent to all replicas of group g.(ii)Handle(e): event e, which has been delivered to Qd, is handled by the application.(iii)Time(t): set the timer to t, which will trigger a timeout event at t.(iv)EVENT ← c: assign content c to event EVENT.

5.3. Neighbor Join

Suppose User j is one of the connectivity neighbors of User i. User j discovers that another user User k is in the perception range but not in the Np of User i; it will notify User i for adding the new neighbor with the following procedure. To distinguish clients from mesh computers, let pi be the client of User i and ri be the replica of group gi (i.e., the mesh computer of User i), the same for pj and pk.

Step 1. pj Multicast(ADD_NEIGHBOR (pk, Gk), y, gi).

Step 2. ri Handle(ADD_NEIGHBOR) for cycle c, ∀ri∈ Gi.(a)ri modifies S← S∪  , U← U∪  .(b)ri calculates tstart,k = trecv,j(y) + n·∆t and trecv,k(1) = tstart,k + ∆t.(c)ri calculates cycle ck = (trecv,k(1) – tnow) / ∆t⌉ + c.(d)ri Time(ck) for receiving the first event from pk.(e)ri sends (HANDSHAKE (tstart,k, Gi)) to pk.

Step 3. pk adds Gi to the recipient list.

Step 4. pk Calculate tsend,k(1) ← tstart,k  − ∆TnL.

Step 5. pk Time(tsend,k(1)).

Step 6. pk Multicast(e, 0, gi) at tsend,k(1).

Step 7. pk Time(tstart,k + ∆t) for the next event.

Step 8. ri receives EVENT at ck.

Step 9. ri delivers EVENT for ck.

The neighbor join process is illustrated in Figure 10(a). First, the connectivity neighbor pj sends the ADD_NEIGHBOR event to all replicas in gi for adding new neighbor pk. Then, each replica ri modifies the event sender and the update set recipient set, calculates the first event start time for event sending and receiving, and notifies the new neighbor pk. The client pk is timed to send the first event at trecv,j(y) + n·∆t ∆TnL (where n, ∆t, and ∆TnL are preconfigured). Meanwhile, the replicas of gi are waiting for the event of the first cycle ck at trecv,k. Since the start time tstart,k and the first event sequence number j0 = 0 are known to both communication ends, they can individually calculate the time of subsequent event sending and receiving. At last, client pi learns the new neighbor through the update from gi and renders it to User i.

5.4. Neighbor Leave

The procedure of neighbor leave is similar to but simpler than the neighbor join procedure. Suppose User j is one of the connectivity neighbors of User i. When ci discovers that another user User k is out of the perception range but still in Np of User i, it will notify User i with the following procedure.

Step 1. pj Multicast(RM_NEIGHBOR (k), y, gi).

Step 2. ri Handle(RM_NEIGHBOR) for cycle c, ∀ri∈ Gi.(a)ri modifies S S ∖ , U U ∖ for cycle c + 1.

The neighbor leave process is illustrated in Figure 10(b). The connectivity neighbor pj sends the RM_NEIGHBOR event to all replicas in gi for removing the neighbor pk, which can then remove pk and Gk in both the event sender set and the event recipient set for cycle c + 1. Through the update from gi, then client pi learns the leave of neighbor pk and removes pk from display.

6. Theoretical Verification

The correctness of the state update design is determined by the state of all replicas in a group, as well as the clients. Firstly, the correctness of leader election and group reconfiguration are verified, since they support the other propositions. The proof of all lemmas and theorems can be found in Appendix C.

Lemma 1 (leader election synchrony). All the live replicas in G maintain the same Qd, E, and G after leader election.

Lemma 2 (group reconfiguration synchrony). All the live replicas in G maintain the same Qd, E, and G after a group reconfiguration.

Next, without loss of generality, the correctness of the consensus protocol is verified for an arbitrary cycle c. The validity property and the integrity property [29] are not verified here, since they are not related to the main result and easy to be verified. Interested users can prove them. Here, only the agreement property is verified.

Lemma 3 (consensus agreement). If a live replica ri∈ G delivers an event e to E(c) from a consensus instance for cycle c, then e is eventually delivered to E(c) by all the live replicas.

With the above lemmas, the main result can be obtained. But before it, an important property of the late event handling approach needs to be verified first.

Lemma 4 (Ω(s, c) agreement). All the live replicas in G expect delivering the same set of events Ω(s, c) for sender s∈ S and cycle c.

Now, the main result of theoretical verification can be presented with the following theorem and corollary.

Theorem 5 (total-order event delivery). If a live replica ri∈ G delivers two different events e1 and e2 into Qd with λ1 and λ2, then e1 and e2 will eventually be delivered into Qd on all the live replicas with λ1 and λ2 being two non-negative integer numbers and λ1λ2.

Corollary 6 (replica synchronization). All the live replicas in G maintain the same state of their virtual objects.

Another important result is the correctness of garbage collection, which is verified in Theorem 7.

Theorem 7 (garbage collection safety). If event e is removed from Qd on ri G, then e has been handled by the application on all the live replicas in G.

Based on Theorem 5, the correctness of the neighbor change procedures is shown with the following corollaries.

Corollary 8 (total-order event delivery with sender join). All the live replicas in G deliver the same first event e0 from a neighbor s with the same delivery sequence λ0.

Corollary 9 (total-order event delivery with sender leave). All the live replicas in G deliver the same last event from a neighbor s with the same delivery sequence .

7. Performance Analysis and Comparison

The performance of the proposed fast event delivery approach is studied in terms of synchronization delay and update loss rate. Three alternative approaches are introduced and compared with the proposed approach: the primary-backup approach, the reliable primary-backup approach, and the consensus-based total-order approach.

In the primary-backup approach [11], one replica is the primary replica and the rest are the backup replicas. The primary receives and handles all events and then broadcasts updates to recipients. Meanwhile, the primary replica sends the received events to backups for fault-tolerance. In a reliable primary-backup, the primary broadcasts the update only after the events have been reliably synchronized to all backups. Note that the unreliable primary-backup approach does not ensure state consistency in case of primary failure. The consensus-based total-order approach [29] is similar to the proposed design, except that all events are delivered through consensus. Specifically, in each cycle, all replicas propose the received events within the cycle; the leader decides the events delivery order for the cycle.

Synchronization delay describes the time consumed in synchronizing the events over all live replicas. The primary-backup approach does not have a synchronization delay. In the reliable primary-backup approach, only 2 communication steps are involved in event synchronization: the primary broadcasts the events to all backups and collects the response from the backups. The consensus-based total-order approach needs one more communication step, as shown in Figure 8. In the proposed approach, synchronization delay is factored by the probability psync of triggering the consensus protocol.

Update loss rate describes the probability that a client does not receive the corresponding update after it sends an event to a mesh computer, due to event loss or update loss. In the primary-backup approach, update loss will occur as long as the channel between an event sender/update recipient and the primary replica is failed. In the consensus-based approach and the proposed approach, update loss occurs only when no replica receives the event or all replicas fail to send the update to a recipient. Moreover, assume that late and out-of-order events are discarded in all the approaches.

The performance comparison of different approaches is shown in Table 2, in which dc denotes the delay in collecting a message from all replicas, dm denotes the reliable multicast delay, and ploss denotes the probability of message loss on a link.

The comparison result shows that if psync is small, i.e., transmission latency and clock offset are low, then the synchronization delay of the proposed approach is small and may even be close to that of the unreliable primary-backup approach. Thus the proposed approach can opportunistically provide higher responsiveness than the consensus-based total-order approach and the reliable primary-backup approach.

For update loss rate, for n ≥ 2, which can be proved as follows.

First, let n = 2. Then, Thus, < for n = 2.

Second, let , x . Then, Thus, monotonically decreases along with n.

Therefore, for n ≥ 2. This shows that the consensus-based total-order approach and the proposed approach have lower update loss rate than the primary-back approaches.

8. Experiments and Results

8.1. Simulation Setup

The proposed model is evaluated by simulating distributed computing. Experiments are run in OMNeT++ to simulate message transmission in a network and event-based programming (simulation code at https://github.com/sunniel/VirtualNetEventHandling). The simulation is run by sending events from 10 clients to a replica group, representing 10 neighbors. The replica group size is configured to 5. Cycle length is set to 200ms. In experiments, each client sends more than 9000 events to the replica group, which can simulate a half-hour game session with 200ms user operation interarrival time, applicable to most game genres [32]. After events are sorted and handled, updates will be transmitted to clients for collecting the statistic result. New replicas are generated by the Rendezvous; if the group size is lower than the availability threshold new replicas will be generated by the Rendezvous.

The network traffic model includes packet latency and packet loss. The packet loss rate is varied to simulate different rate of message loss rate ploss. The packet delay is calculated by one-trip communication delay and network jitter. To facilitate simulation, network traffic is generated by a generating function from the analytical result of real data. Reference [33] suggests that the one-way delay between two hosts H1 and H2 can be modelled by delay(H1, H2) = Dmin + jitter, where Dmin is the minimum single-trip delay. jitter is the network jitter caused by network congestion. In the experiments, Dmin is configured to 50ms and jitter is modelled by an exponential distribution and varied to simulate the variation of network latency.

To simulate replica failure, replica dynamics is characterized by session length which measures the length of time that a peer is continuously connected to a given P2P network, from its arrival to its departure [34]. Session length of P2P applications can be depicted by different stochastic models. Reference [34] shows that Weibull distribution or log-normal distribution fits the observation best. In this study, replica session length is modelled by a Weibull distribution with the mean value of half hour.

8.2. Experiment Results for Overall Performance

To verify the overall performance of event handling, three alternative event delivery approaches are implemented, including primary-backup, consensus-based total-order, and fast event delivery, for comparing their performances. Reliable primary-backup is not included in the comparison, since its performance is in-between the unreliable primary-backup approach and the consensus-based approach.

First, responsiveness is evaluated by comparing the interaction latencies in different approaches. Interaction latency includes both the round-trip end-to-end delay between an event sender and a group of replicas and the synchronization delay. The mean value of network jitter is fixed to 50ms, and the standard deviation is changed from 50ms to 250ms to simulate the scenario that events occasionally come late and out-of-order. The experiment result in Figure 11 shows that the fast event delivery approach provides much lower latency than the consensus-based total-order approach. Especially when the network latency is small, the responsiveness of the proposed approach is close to the primary-backup model. This is because the rate of triggering the consensus protocol decreases, when most events arrive before the end of cycles.

Second, end-to-end update delivery rate is evaluated by varying the message drop rate to simulate the change of ploss from 0.3 to 0.7. In an asynchronous network, message loss cannot be distinguished from long message delay. Thus, update delivery timeout is used to cover both situations, which is configured to 5 seconds. The mean value of network jitter is fixed to 50ms to eliminate the interference of late events.

Figure 12 shows the update delivery rate of the three different approaches. Specifically, the update delivery rate of the primary-backup approach is much lower than the other two approaches. Moreover, it drops quickly from around 0.7 to below 0.3 with the increase of ploss, showing the rapid increase of update loss rate. In contrast, the update delivery rate of the consensus-based total-order approach and the proposed approach (overlapped) can remain high. Before ploss is lower than 0.5, the update delivery rate of these two approaches is close to 1. When ploss is lower than 0.5, the drop of the update delivery rate of them becomes evident. This is because, with the increase of message drop rate, more messages are received by none of the replicas.

In the same experiment, interaction latency is also studied with the change message drop rate. Figure 13 shows that, with the increase of message drop, a replica has a higher chance to miss the cycle events, such that Π(s, c) ≠ Ω(s, c) and more consensus instances are triggered by replicas for event synchronization. This means that increasing message drop has a similar effect of increasing network jitter on the fast event delivery approach in interaction latency.

8.3. Experiment Results for Individual Improvements
8.3.1. Performance of Late Event Handling

The experiment result in this section verifies the late event handling approach. The proposed approach is compared with the simple discard approach which simply discards any late events. In the experiment, the timing of event sending is modified with clock error which is modelled with a normal distribution (μ = 0). The standard deviation of the clock error is changed from 0 to 400ms to increase the rate of late event. Moreover, network jitter is fixed to 50ms and the message drop rate ploss is fixed to 0 to eliminate their interference to the result.

Figure 14 shows that the proposed approach has a much higher update delivery rate than the simple discard approach. Especially when the clock error is higher than 300ms, simply discarding late events results in that almost no message is delivered, since most events will come late. On the other hand, through the proposed approach, the update delivery rate can be maintained as high as close to 1. But the update delivery rate is lower than 1, because a few late events do not meet the deliverability condition.

8.3.2. Performance of Garbage Collection

Garbage collection is tested to verify its effectiveness. The main purpose of the experiment is to show that the proposed mechanism can effectively limit the length of Qd from overgrowth. Thus, the experiment is conducted with two different settings: one with garbage collection and the other without garbage collection. The increase of the delivery queue length (Qd) is observed and compared for the two settings. Network jitter is fixed to 50ms and the message drop rate ploss is configured to 0 to remove the interference of event loss in both settings, so that the length of Qd is only determined by the number of events and garbage collection. In the second setting, the length of the garbage collection cycle is fixed to 5 seconds. The experiment result is shown in Figure 15. In the case that garbage collection is not applied, the length of Qd quickly increases from several thousand events to tens of thousands of events within 500 seconds. On the other hand, if garbage collection is applied, the change of Qd is restrained within 300 hundred events. The comparison result shows that the proposed garbage collection protocol can effectively prevent Qd from unlimited growth or even overflow.

Moreover, the cycle length of the gossip protocol is changed to show the control of the protocol on Qd length. The experiment result is shown in Figure 16. When the cycle length of the gossip protocol increases from 1 second to 10 seconds, the length of Qd changes from around 50 events to around 500 events. This experiment result shows that the length of Qd is approximately linear to the cycle length of the gossip protocol. It implies that the length of Qd can be effectively controlled by changing the cycle length. This control is useful because different virtual world applications could have different size of an event. If an application is required to cache large events, the length of Qd will be reduced for the same space of event cache.

8.3.3. Performance of Time Synchronization

The performance improvement through time synchronization is shown in Figures 17 and 18. The main purpose of the experiment is to show that the time synchronization mechanism has positive effect on interaction latency reduction and garbage collection. Thus, the experiment compares the performance of event handling in two different settings: one with time synchronization and the other without time synchronization. Clock error is added to the timing of event sending to simulate the scale of clock synchronization loss between the event senders and recipients. Network jitter is fixed to 50ms and the message drop rate ploss is fixed to 0 to eliminate their interference to the result.

Figure 17 shows that if the clocks between event senders and event recipients are out-of-sync, there is an increase of interaction latency along with clock offset, because time synchronization loss increases the chance of triggering consensus for event delivery. Note that the interaction latency increase with clock offset is almost linear, clearly showing the impact of synchronization loss. In contrast, if time synchronization is applied before event transmission, the interaction latency is less than 1 second. This is because the number consensus can be reduced and thus the communication steps for event delivery can be minimized accordingly.

Figure 18 shows the increase of the delivery queue (Qd) also along with clock offset. Due to clock synchronization loss, more cycles are delivered with empty events, and thus more empty events are at the trail of Qd. According to the rule of garbage collection, trailing rounds with empty events cannot be removed from Qd. Thus, clock synchronization loss weakens the effectiveness of the garbage collection mechanism. Note that when the clock offset exceeds 300ms, the increase of Qd will become large. In contrast, if time synchronization is applied, the length of Qd does not increase along with clock offset.

8.4. Discussion

The experiment results for the overall performance show similar results to the theoretical analysis. Specifically, the proposed fast event delivery approach is reliable and can provide opportunistically high responsiveness, compared with the consensus-based total-order approach and the primary-backup approach, because this approach has the highest update delivery rate, and almost the same interaction latency as the primary-backup approach when the network latency is low. Thus, the overall performance of the proposed approach is better than the other two alternative approaches. The results also imply that, in practice, it is better to select cache nodes which are close to clients so that most cycle events can arrive on time at all replicas and they can be delivered without consensus. Then, interaction latency can be minimized.

The experiment results for the individual improvements show their effectiveness on system performance improvement. Specifically, the experiment result of late event handling shows that the proposed dynamic cycle event delivery approach can largely increase update delivery rate. High update delivery rate can reduce the chance of event resending which will lower down responsiveness in interaction and impact user experience. The experiment result of garbage collection shows its effectiveness in limiting the length of the event delivery queue. This is important, because it can not only avoid buffer overflow, but also restrict the time in traversing Qd, if searching for a specific event is needed. Traversing a large buffer is slow and reduces system responsiveness. Lastly, the evaluation result of time synchronization shows its importance. Without the time synchronization between event senders and event recipients, the advantage of the fast event handling approach and the effect of the garbage collection mechanism diminishes.

9. Conclusions

With the popularity of mobile virtual worlds, scalability becomes an outstanding challenge in infrastructure development. The possibility of P2P technology is discussed to address the scalability problem. Different from existing P2P virtual worlds, client unreliability raises a new problem in mobile settings. This paper tries to solve the problem with a new hierarchical P2P computing model. Yet, rather than introducing every detail of the computing model, we focus on object state update to avoid reinventing wheel. The core problem of state update is to maintain the replica state consistency without compromising system responsiveness. To address the problem, a fast event delivery approach is proposed. Based on this approach, we introduce the new virtual world interaction model to enable the interaction between multiple users.

Our work is important in providing a scalable infrastructure for mobile P2P virtual worlds. Based on the proposed Virtual Net architecture, there are some new research problems for building virtual world applications. First, our current approach still has the limitation in high responsiveness, since it belongs to the opportunistic category. To further improve system responsiveness without compromising state consistency, we plan to employ conflict-free replicated data types (CRDT) [35] to replace the consensus approach in event handling. With CRDT, events can be delivered in any sequence. However, events delivery in different sequences may cause user confusion with respect to continues events, such as avatar movement. Thus, it can be expected that the problem will be a combination of human-computer interaction (HCI) distributed computing. Moreover, the future study also includes the application and adaptation of cloud-fog computing techniques for contributed resource management, including cache node allocation, and P2P virtual world techniques to provide a complete and practical mobile P2P virtual world solution.

Appendix

A. Fast Event Delivery Protocols

The full set of the fast event delivery protocols are described in this appendix, which includes event collection, event delivery, and consensus. The payload of an event could be the operation of the event sender, Empty, or (called event). Note that if a reliable communication channel is required in a function, the keyword Reliably will be added before the send or broadcast operation. The implementation of a reliable channel can be found in [29]. Message names are capitalized and messages could contain some parameters. The notations used in the pseudocodes are listed in Table 3. Particularly, G∩ R denotes the set of live replicas.

In each cycle ∆t, the event collection protocol (Algorithm 4) periodically collects the received events from all senders in S in the receiving buffer Qr and to a temporary buffer Qp. As described in the Late Event Handling section, not only the cycle events (i.e., Seq(e) = Seq(i, c), Lines 8-11) but also the late and still deliverable events are collected (Lines 12-15). If any expected event is not received, a event will be assigned to the corresponding event sequence number (Line 10). Likewise, a late event replaces an empty event with the same sequence number in Qp (Lines 14-15). All undeliverable events will be discarded in event receiving (Line 3).

1. On replica :
2. Upon EVENT e
3.  If Seq(e) > MaxSeq(Sender(e), c - 1), then
4.   Qr Qr
5.
6. Upon cycle TIMEOUT
7.  c c + 1
8.  For each s∈S,
9.   If e(s, Seq(s, c)) ∉Qr, then
10.    e← (s, Seq(s, c), ⊥)
11.   Qp Qp
12.  For each c’∈   ∧ s Se(s, c’) ⊈Qd,        // Collect late and deliverable events
13.   If e(s, Seq(s, c’)) ∈Qr
14.    Qp Qp
15.    Qp Qp
16.    Qr Qr
17.  Reset Timer cycle ∆t

The event delivery protocol (Algorithm 5) is executed by detecting the condition satisfaction of a cycle. If the events of cycle c – 1 have been delivered and events of cycle c have either (1) been collected or (2) been decided from a consensus instance (Line 2), then the cycle c satisfies the condition of triggering the event delivery protocol. Thus, the execution of event delivery is asynchronous to event collection. For distributed agreement, the protocol firstly checks the second condition to ensure that the consensus result will be applied on all replicas. If the cycle is decided by a consensus instance, the decided events will be delivered no matter whether there is any nonempty event newly received for the cycle (Lines 3-4). Otherwise, events will be delivered from Qp. If all expected events have been collected (Lines 6-8, 13-14), then they will be delivered to Qd in the sequence of γ for cycle c. The range [MinSeq(s, c), Seq(s, c)] specifies the deliverable sequence number for each sender s and cycle c. The calculation of γ (Line 16) ensures that all replicas can deliver the concurrent events from different senders in the same sequence. If there is any event, a query message will be sent to the group leader. The set union and ) specify the lowest deliverable sequence number and the sequence number of the cycle respectively for all senders.

1. 
2. Upon
  ¬Consensus(c)
3.  If E(c) ≠ ∅, then
4.   D E(c)
5.  Else
6.   For each s∈Sj∈ [MinSeq(s, c), Seq(s, c)],
7.    c’ c – (Seq(s, c) – j)
8.    , e(s, c’)) ∣ (c’, e(s, j))
9.   If , e) ∣ Payload(e) = ⊆T, then
10.    QUERY
11.    Reliably send QUERY to rL
12.    End the procedure
13   Else
14.    D D∪T
15.  For each (c, e) in D,
16.   
17.   Qd Qd
18.   c c + 1
19.
20.
21. Upon QUERY(c, , ) from ri
22.  For each s∈Sj∈ [MinSeq(s, c), Seq(s, c)],
23.   c’ c – (Seq(s, c) – j)
24.    ∣ (c’, e(i, j)) ∈Qp   (c, γ, e(s, j))
    
25.  If Payload(e) = ⊆R, then
26.   QUERY_REPLY (c, R)
27.   Reliably send QUERY_REPLY to ri
28  Else
29.   

The leader checks locally the receipt of the events for the requested cycle in Qp and Qd. If all the expected events (for each sender s and each expected sequence number [MinSeq(s, c), Seq(s, c)]) of the requested cycle have been received, it will reply to them with the requesting replica (Lines 22-27). Otherwise, the leader will initialize a new consensus instance for the cycle (Lines 28-29).

The consensus protocol (Algorithm 6) is run and instantiated for each requested cycle c. Note that the consensus protocol is only executed when there is no leader election (LE) or group reconfiguration (GE). Also, a message from a previous leader or a previous group configuration is not processed for consistency. Thus, these preconditions are added in all message handling procedures in the consensus protocol (Lines 8, 11, 19, and 26). First, the flags LE and GE are checked to ensure that the previous leader election and group reconfiguration have been finished. Then, the message sender’s epoch and configuration ID are compared with the local epoch and configuration ID via adding the sender epoch and configuration ID to each message.

1. 
2. Upon P ≠ ∅  ∧ CR = falseLE = false
3.  For each () ∈Pc∉Z,
4.   
5.   QUERY ()
6.   Reliably send QUERY to
7.
8. Upon QUERY_RESULT(epoch’, cid’, r, Wi) from ri  ∧ epoch’ = epoch
  cid’ = cidCR = falseLE = false
9.  
10.
11. Upon G∩R⊆
12.  R’ Decide(c, Q)
13.  
14.  
15.  DECISION (epoch, cid, c, W’)
16.  Reliably broadcast DECISION to
17.
18.
19. Upon QUERY(epoch’, cid’, r,) from
    epoch’ = epochcid’ = cidGR = false LE = false
20.  For each ,
21.   c’ = c – (Seq(s, c) – j)
22.   W W∪
23.  QUERY_RESULT (epoch, cid, c, W)
24.  Reliably send QUERY_RESULT to rL
25.
26. Upon DECISION(epoch, cid’, c, W’) from rL  ∧ epoch’ = epochcid’ =
    cidGR = falseLE = false
27.  E E W’
28.
29. Decide(c, Q)
30.  For each s∈S and ,
30.   If ∃e(s, j): e(s, j) ≠ ⊥ e(s, j) , then
31.    W’ W’
32.   Else
33.    e← (s, j, Empty)
34.    W’ W’
31.  Return W’

The consensus protocol is described in the Total-Order Event Delivery section in detail. A replica replies to the leader for the query of the events for cycle c only when the replica has passed the event collection of cycle c, which requires the following: The decided events for cycle c - 1 have been delivered, if there is any. The events collection for cycle c has been done. The leader will decide the events for c only after all proposals are received from all live replicas (Line 11). The Decide function determined the events of a given cycle for each event sender (Lines 29-33) by the given set of received events for c from all replicas. For each sender s and event sequence number j, if all replicas propose , then the payload of the event e(s, j) will be decided with Empty (Lines 33-34). Otherwise, the event payload will be decided with the value of the proposal from any replica (Lines 30-31).

B. Leader Election and Group Reconfiguration Protocols

The notations that appeared in the leader election protocol and the group reconfiguration protocol follow the same convention listed in Table 3.

The leader election protocol (Algorithm 7) is triggered once the leader is not in the set of live replicas (Line 2). Each triggered replica checks whether it satisfies the condition to be the candidate by calling the SelectLeader function (Lines 8-13). As described in the Leader Election and Group Reconfiguration section, the candidate has the smallest age. If multiple candidates have the same age, the one with the smallest ID is selected.

1. On any replica:
2. Upon rL∉Rrcself
3.  LE true
4.  rc SelectLeader()
5.  If rc = self then
6.   Reliably broadcast LE_QUERY to G∩R
7.
8. SelectLeader()
9.  Rc  :=
10.  If = 1, then
11.   Return ri: ri∈Rc
12.  Else
13.   Return ri: ri∈Rc  ∧ ri.ID =
14.
15.
16 Upon LE_QUERY from rc
17.  If rc∈Rrc = SelectLeader(), then
18.   LE true
19.   LE_STATE (Qd, E, cid, G, epoch)
20.   Reliably send LAST_STATE to rc
21.  Else
22.   Reliably send NACK to rc
23.
24. Upon LOAD_LEADER(Qd’, E’, epoch’, cid’, G’, Init)
    from rc  ∧ rc∈Rrc = SelectLeader()
25.  (Qd, E, cid, G, epoch) (Qd’, E’, cid’, G’, epoch’ + 1)
26.  (rL, rc) (rc, ⊥)
27.  LE false
28.  If newReplica = true, then                 // New replica initialization is
29.   Initialize(Init)                      needed, in case that a group
30.   newReplica false                   reconfiguration is interrupted
                              by a leader election
31.
32.
33. Upon NACK from ri
34.  If self = Selectleader()
35.   Reliably send LE_QUERY to ri
36.  Else
37.   rc
38.
39. Upon LE_STATE(Qd,i, Ei, cidi, Gi, epochi) from ri
40.  Events←Events∪  
41.  Decisions←Decision∪  
42.  Configs←Configs∪  
43.  Epochs←Epochs∪  
44.  Senders←Senders∪  
45.
46. Upon G∩R ⊆ Senders
47.  Qd Longest(Events)
48.  E Merge(Decisions)
49.   (cid, G) Latest(Configs)
50.  epoch Latest(Epochs)
51.  Init← (t0,, (λc, state),          // state: current application
    ,S)                          state
52.  LOAD_ LEADER (Qd, E, epoch, cid, G, Init)
53.  Reliably broadcast LOAD_LEADER to G∩R
54.  Broadcast G to S

To achieve state synchrony, the candidate rc sends the state query message (LE_QUERY) to all live replicas. If a replica has not learned the candidate or has learned a new candidate, the replica will reject the request from rc by replying with the NACK message (Lines 17, 21-22). Otherwise, the replica will reply the state query with its state Qd, E, epoch, and configuration (cid and G). rc, on receiving the states from all live replicas, decides the latest consistent state (Lines 47-53) with the following functions.(i)Longest(): selects the longest Qd from all replicas.(ii)Merge(): returns the union of the Decision sets from all the replicas for all cycles.(iii)Latest(): returns the largest configuration ID cid and the corresponding replica set G, which represents the latest configuration seen by the group.(iv)Latest(): returns the largest epoch which represents the latest leader election seen by the group.

Moreover, in case of any unfinished group reconfiguration, additional states (including the time of the first cycle t0, the start time of the first event from all senders , the current state of the application and the corresponding delivered sequence of the event λc, the age of replicas, and the sender set (S)) are synchronized from rc to new replicas for state initialization. After receiving the LE_STATE message from rc, the replicas update their state to the decided value. Finally, all replicas load the rc as the new leader and update the epoch by one.

The group reconfiguration protocol (Algorithm 8) is similar to the leader election protocol, except that it has lower priority, which is reflected by the precondition of checking the flag LE in all message handling procedures (Lines 11, 19, and 30). Group reconfiguration is triggered, when new replicas are added in the survival (i.e., R G). GT caches the latest triggered reconfiguration to preclude any unnecessary retriggering (Lines 2-3). At the end of the reconfiguration, each replica updates the age of all replicas by one (Lines 26-27).

1. On any replica:
2. Upon RG ≠ ∅  ∧ RGT  ∧ LE = false
3.  GT R
4.  GR← true
5.  If rL = self, then
6.   cid cid + 1
7.   GR_QUERY (epoch, cid)
8.   Reliably broadcast GR_QUERY to GT  ∩R
9.
10.
11. Upon GR_QUERY(epoch’, cid’) from rL  ∧ epoch’ ≥          // Use cid to discard messages from a
   epochcid’ > cidLE = false                 previous unfinished GR;
12.  GR true                        // epoch’epoch: for new members
13.  cid cid’                       // cid’ > cid: because the new cid has not
14.  If epoch = 0, then                     been received
15.   epoch epoch’
16.  GE_STATE (Qd, E, cid, epoch)
17.  Reliably send GE _STATE to rL
18.
19. Upon LOAD_CONFIG(Qd’, E’, epoch’, cid’, GT, Init)
   from rL  ∧ epoch’ = epochcid’ = cidLE = false
20.  (Qd, E) (Qd’, E’)
21.  G GT
22.  LE false
23.  If newReplica = true, then
24.   Initialize(Init)
25.   newReplica←false
26.  For each r∈G∩R,
27.   r.age r.age + 1
28.
29.
30. Upon GE _STATE(Qd,i, Ei, cidi, epochi) from ri  ∧ epochi
   = epochcidi = cidLE = false
31.  Events←Events∪  
32.  Decisions←Decision∪  
33.  Senders←Senders∪  
34.
35. Upon GT  ∩R⊆Senders
36.  Qd Longest(Events)
37.  E Merge(Decisions)
38.  Init (t0,  , (λc, state),      // state: current application state
   ,S)
39.  LOAD_ CONFIG (Qd, E, epoch, cid, GT, Init)
40.  Reliably broadcast LOAD_ CONFIG to GT  ∩R
41.  Broadcast GT to S
42.  GT

C. Proposition Proofs

See Lemma 1.

Proof. First, only one leader will eventually be elected by all the live replicas. It can be inferred by two cases. In the first case, the group is not partitioned. Then all the live replicas know each other, and the SelectLeader function ensures that only one leader is elected by all live replicas. In the second case, the group is partitioned. Without loss of generality, suppose there are two different leaders, denoted by rL,1 and rL,2. rL,1 is elected by replica set P and rL,2 is elected by replica set Q. rL,1∉ Q, rL,2∉ P, and P = G Q. Following the partial synchrony assumption, if the replicas in P never know Q and vice versa, then either P or Q is removed by the Rendezvous of the group. Following the assumption that there is only one Rendezvous for each replica group, then only one partition, either P or Q, will eventually survive. Thus, eventually there is only one leader; either rL,1 or rL,2 is the leader of the group.
When a new leader is elected by all replicas, it will determine the Qd, E, and G and broadcast them to all the live replicas. Through the reliable underlying channel, all replicas will eventually load the same Qd, E, and G after leader election. Moreover, a monotonic epoch number is used to avoid a replica load state from an old leader. Thus, all live replicas will eventually load the same Qd, E, and G after the leader election of the largest epoch.

See Lemma 2.

The proof of group reconfiguration synchrony is the same as that of leader election synchrony. Thus, it is not repeated here.

See Lemma 3.

Proof. If there is a leader election or a group reconfiguration before the consensus instance terminates, then Lemmas 1 and 2 ensure that all replicas will have e in E(c). If there is no leader election or group reconfiguration before the consensus instance terminates, the reliable underlying communication channel ensures that all the live replicas will eventually receive the same decision from the leader. Since ri has delivered e into E(c), e is in the decision for cycle c. Therefore, e will be eventually received and delivered by all live replicas.

With the above lemmas, the main result can be obtained. But before it, an important property of the late event handling approach needs to be verified first.

See Lemma 4.

Proof. The lemma can be proved by induction.
Basis Step. When c = c0, i.e., the cycle of receiving the first event from s based on trecv,s(1), then Ω(s, c) = .
Induction Step. Assume all the live replicas in G expect delivering the same set of events Ω(s, ck) for sender s∈ S and cycle ck (ck ≥ c0). Then, for cycle ck + 1, there are two cases for discussion. (1)If there is no consensus instance for cycle ck, then MaxSeq(s, ck) = Seq(s, ck) and Ω(s, ck + 1) = on all replicas.(2)If there is consensus instance for cycle ck, then, following Lemma 3, all live replicas will eventually deliver the same events to E(ck). Let Seq(s, j) be the maximal sequence number of nonempty events in E(ck). Then, MaxSeq(s, ck) = Seq(s, j) and Ω(s, ck + 1) = on all the live replicas. By the principle of mathematical induction, it follows that the lemma is true for all cycles after c0.

See Theorem 5.

Proof. Since all replicas share the same sender set S, Lemmas 3 and 4 ensure that all the live replicas will eventually deliver the same set of events for any cycle, either directly from received events (Lines 5-14 of Algorithm 1) or from the consensus result (Lines 2-3 of Algorithm 1).
Let e1(s1, j1) and e2(s2, j2) be delivered on r for the cycle c1 and cycle c2. If c1 = c2 = c, then (c, γ1, e1) and (c, γ2, e2) will be eventually delivered into Qd of all replicas. If c1 ≠ c2, then (c1, γ1, e1) and (c2, γ2, e2) will be eventually delivered into Qd of all replicas. Moreover, since γ1 and γ2 are determined only by s1, s2, j1, and j2, γ1γ2 for different e1 and e2. Since Qd is linearly ordered by c and then by γ, there exists mapping from each unique (c, γ) to a unique nonnegative integer number λ and let φ(c, γ) = λ be such mapping function. Let φ(c1, γ1) = λ1 and φ(c2, γ2) = λ2. Then, all replicas will eventually deliver (λ1, e1) and (λ2, e2) and λ1λ2.

See Corollary 6.

Corollary 6 can be directly inferred from Theorem 5.

See Theorem 7.

Proof. Theorem 5 ensures that if e is in Qd of ri, then e is or was in Qd of all the live replicas in G with the same λ. In Algorithm 3, if e can be removed from ri, then ri must have received λc’s at least equal to λ from all the live replicas. Since events are delivered to the application in sequence, e must have been delivered to the application on all replicas.

See Corollary 8.

Proof. Theorem 5 ensures that the ADD_NEIGHBOR event is delivered to of all replicas with the same λ. Since (λ, e) and (c, γ, e) have a one-to-one mapping for the same event, all replicas deliver ADD_NEIGHBOR for the same cycle. Moreover, since n, ∆t are fixed, all replicas are timed to deliver the first event e0 from s for the same future cycle ck. Theorem 5 ensures that e0 is delivered with the same delivery sequence λ0 on all live replicas.

See Corollary 9.

Proof. Theorem 5 ensures that the RM_NEIGHBOR event is delivered to Qd of all replicas with the same λ. Since (λ, e) and (c, γ, e) have a one-to-one mapping for the same event, all replicas handle RM_NEIGHBOR for the same cycle c. From cycle c + 1, s will be removed from S. Thus, all replicas will deliver the last event of s at c. Theorem 5 ensures that is delivered with the same delivery sequence on all live replicas.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research is partially supported by the University of Macau Research Grant No. MYRG2017-00091-FST and MYRG2015-00043-FST.