Abstract

Existing Internet protocols assume persistent end-to-end connectivity, which cannot be guaranteed in disruptive and high-latency space environments. To operate over these challenging networks, a store-carry-and-forward communication architecture called Delay/Disruption Tolerant Networking (DTN) has been proposed. This work provides the first examination of the performance and robustness of Contact Graph Routing (CGR) algorithm, the state-of-the-art routing scheme for space-based DTNs. To this end, after a thorough description of CGR, two appealing satellite constellations are proposed and evaluated by means of simulations. Indeed, the DtnSim simulator is introduced as another relevant contribution of this work. Results enabled the authors to identify existing CGR weaknesses and enhancement opportunities.

1. Introduction

The autonomous transmission of information resources and services through Internet has changed the lifestyle on Earth. Moreover, the potential benefits of extending Internet into space have been analyzed by the community [14]. Nonetheless, the consideration of Internet for space missions has been limited due to fundamental environmental differences. In particular, in a space flight mission, the highly varying communication ranges, the effect of planet rotation, and on-board power restrictions compels communication systems to face several disruptive situations nonexistent on Internet systems. Furthermore, the propagation delay of signals on Deep Space environments is generally in the order of minutes or even hours. These delay and disruption conditions contraindicate traditional Internet protocol operations as they are largely based on instant flow of information among sending and receiver nodes.

As a result, Delay/Disruption Tolerant Networks (DTNs) have recently been considered as an alternative to extending Internet boundaries into space [5]. In particular, recent studies have considered their applicability in Low-Earth Orbit (LEO) satellite constellations [611]. To overcome link disruptions, DTN nodes temporarily store and carry in-transit data until a suitable next-hop link becomes available [12]. To overcome delays, end-to-end feedback messages are no longer assumed continuous nor instantaneous. This distinctive characteristic allows DTN to operate in environments where communications can be challenged by latency, bandwidth, data integrity, and stability issues [13].

During the last decade, the DTN Bundle protocol, along with different adaptation layers, has been proposed [1419], several routing strategies were studied [2030], and diverse software stacks were publicly released [3135]. Furthermore, some of the latter approaches were successfully validated both on LEO [36] and Deep Space missions [37] driven by the UK Space Agency and NASA, respectively. Also, DTN has been in pilot studies in the International Space Station (ISS) since 2009 [38] and has been operational on ISS since May of 2016. Presently, the Internet research community [39] along with several space organizations [40, 41] has joined the industry in the standardization of DTN protocols [42].

In spite of the recent advances in the area, the analysis of the fault-tolerance of existing DTN solutions remains an open research topic. Studying reliability of DTN is mandatory before seriously considering its applicability in the harsh space environment where radiation effects, vibrations, collisions, and outgassing, among others, pose significant challenges for man-made spacecraft. In [43], the reliability of opportunistic DTNs was studied, but their results do not apply to the space domain where communications are deterministic. More recently, authors have presented preliminary results on a reliability assessment of DTN for space applications [44]. However, the provided simulation analysis was based on simplistic satellite networks without a comprehensive analysis on the behavior of the underlying routing algorithm.

In this work, we tackle the weaknesses of [44] by providing an extensive fault injection analysis based on two appealing and realistic case studies of delay-tolerant satellite constellations previously presented in [10]. An initial performance comparison of these topologies is one of the contributions of this work. For the reliability analysis, the state-of-the-art version of Contact Graph Routing (CGR) algorithm was considered [30]. However, to the best of the authors’ knowledge, there is a lack of an in-depth description of the latest CGR algorithm which is only available as part of an open-source DTN implementation [32]. This results in a fuzzy interpretation of the algorithm which is specifically an issue because several further approaches claim to leverage CGR as a basis (such as [45, 46]). Consequently, providing an accurate and thorough description of how CGR is currently implemented is another contribution of this paper. The CGR algorithm was implemented in DtnSim, a new simulator specifically designed to evaluate space DTNs. DtnSim is also introduced in this article for the first time and is expected to be available via open-source licensing. Results obtained from DtnSim are the final contribution of this article: an assessment of the fault-tolerance capability of the CGR algorithm in DTN constellations. This work can thus be considered an improved, extended, and archival quality version of [44].

The present paper is structured as follows. In Section 2 an overview of DTN, a detailed description of the CGR algorithm, and an appealing failure model are provided. Two realistic satellite constellation scenarios are described and analyzed by means of a new simulation framework in Section 3. In Section 4 open research aspects are summarized and discussed before concluding in Section 5.

2. DTN Overview and System Model

A simple disrupted network of 4 nodes is illustrated in Figure 1, where node A sends data to node D. Similar examples can be found in [47, 48] describing the behavior of satellite constellations with sporadic connectivity as well as Deep Space networks. The time-evolving topology is represented by a timeline of 1600 seconds, where different communication opportunities (also known as contacts) exist among nodes A, B, C, and D. Formally, a contact is defined in [47] as an interval during which it is expected that data will be transmitted by one DTN node to another. For example, node A has two direct contacts with B (A-B) from 1000 s to 1150 s and from 1300 s to 1400 s. However, the effective utilization of these communication episodes depends on the implemented protocols.

Figure 1(a) shows the expected performance of Internet protocols which require a persistent connectivity with the final destination (also known as end-to-end path). Due to the disruptive nature of the network, node A is only able to directly reach nodes D through B from 1100 s to 1150 s (cyan-colored contacts), which allows for an effective throughput of 50 s multiplied by the node’s data-rate. Other contacts will remain unutilized by Internet protocols (yellow-colored contacts). By exploiting a local storage within each node, DTN is able to make a better utilization of the communication resources as shown in Figure 1(b). For example, in order to better use the B to D contact at 1100 s, node A can transmit in advance the data to node B starting from 1000 s. Furthermore, two other delay-tolerant paths can be considered to node D, one via node C at 1100 s and another via node B at 1300 s. Thus, the effective throughput of DTN in this example is of 300 s (600% higher than traditional Internet protocols). For a more general overview, Table 1 compares different aspects between Internet and DTN Protocols.

In spite of the data delivery improvement, different challenges and optimization opportunities exist in DTN. For example, the first contact of nodes A to B in Figure 1(b) is not fully utilized. This is because node A was able to make a good “guess” on node B’s connectivity and its residual capacity with final destination D. However, it can be quite difficult to accurately make such estimations without a stable and permanent connection with neighbor nodes. Without this local knowledge assumed by node A, more data could have been transmitted provoking congestion at node B. Congestion is, indeed, a popular and open research topic in DTN [4951]. In general, and in contrast with Internet protocols, nodes’ reliance on a realistic local understanding of the current connectivity in the network is not always feasible and depends on the type of DTN they run on.

As a result, existing routing and forwarding schemes for DTN have sought to acquire the most complete and precise network state information. In opportunistic DTNs, no assumptions can be made as encounters between nodes occur unexpectedly [28]. For these networks, epidemic strategies based on message replication driven by different criteria have been applied [27, 29]. Nevertheless, in realistic situations, contacts are rarely totally random but obey nodes’ movement with greater probability of meeting certain neighbors than others. Protocols such as [20, 25, 26] are popular solutions that propose to infer the encounter probability to improve data delivery metrics. On the other hand, connectivity in certain DTNs such as space networks can be precisely anticipated based on accurate mathematical models describing object trajectories in space. In the literature, these networks are known as scheduled or deterministic DTNs [12] and are the appropriate model to study satellite constellations.

In general, spacecraft trajectories and orientation can be accurately predicted by means of appropriate mathematical models [52]. Also, mission operations generally account for precise models of the communication systems both on-board and on-ground. As a result, the forthcoming spacecraft to spacecraft or spacecraft to ground contacts can be determined or even controlled in advance. This unique characteristic has made routing in scheduled DTN a distinct research area of increasing interest during the last decade.

First analysis on routing in scheduled DTNs dates back to 2003 where Xuan et al. proposed time-evolving graph to represent changes in network topologies to then study shortest, foremost, and fastest journey metrics [21]. Later, a specific routing framework was introduced in [22] to derive a space-time routing table comprising next hops for each time interval. Similar schemes were reported in [23]. However, these static route calculation approaches relied on a complete precalculation of routes on ground and a timely distribution to the network nodes. Due to the precalculation, these approaches lacked responsiveness to varying traffic conditions and topology changes resulting from dynamically added or removed contacts. An alternative approach addressing this shortcoming was later introduced under the name of Contact Graph Routing (CGR).

2.1. Contact Graph Routing

Instead of a centralized route calculation, CGR, proposed by S. Burleigh (NASA JPL), follows a distributed approach: the next hop is determined by each DTN node on the path by recomputing the best route to destination, as soon as a bundle (i.e., bundle protocol data unit) is received. This routing procedure assumes that a global contact plan, comprising all forthcoming contacts, is timely distributed in advance in order to enable each node to have an accurate understanding of the network [47]. Table 2 shows the contact plan for the sample network from Figure 1. Routes can thus be calculated by each node on demand, based on extensive topological knowledge of the network combined with the assumed traffic status. This workflow is illustrated in Figure 2 where a contact plan is initially determined by means of orbital propagators and communication models, then distributed to the network, and finally used by the DTN nodes to calculate efficient routes to the required destinations. Indeed, by combining the contact plan with local information such as outbound queue backlog and excluded neighbors (e.g., unresponsive neighbors), CGR is able to dynamically respond to changes in network topology and traffic demands. An early version of CGR was flight-validated in Deep Space by NASA in 2008 [37] and CGR has been one of the most studied routing solutions for space networking since then.

CGR was initially documented in 2009 and later updated in 2010 as an experimental IETF Internet Draft [53]. By the end of the same year, Segui et al. [54] proposed earliest-arrival-time as a convenient monotonically decreasing optimization metric that avoids routing loops and enables the use of standard Dijkstra’s algorithm for path selection [55]. This enhancement was then introduced in the official version of CGR, included in the Interplanetary Overlay Network (ION) DTN Stack developed by the Jet Propulsion Laboratory [32]. In 2012, Birrane et al. proposed source routing to reduce CGR computations in intermediate nodes at the expense of packet header overhead [56]. Later, in 2014, authors studied the implementation of temporal route lists as a mean to minimize CGR executions [57]. In the same year, Bezirgiannidis et al. [58] suggested to monitor transmission queues within CGR as they increase the earliest transmission opportunity (CGR-ETO). On the same paper, a complementary overbooking management innovation enables proactive reforwarding of bundles whose place in the outbound queue was taken by subsequent higher priority traffic. Both ETO and overbooking management are now included in the official CGR version. Regarding congestion management, further extensions were proposed as well [4951]. Most recently, in 2016, Burleigh et al. introduced an opportunistic extension as a means of enlarging CGR applicability from deterministic space networks to opportunistic terrestrial networks [48]. The latter is not a trivial contribution since it could, if successful, pave the way towards implementing space DTN advances on ground-based networks. At the time of this writing, CGR procedure is being formally standardized as part of the Schedule-Aware Bundle Routing (SABR) procedure in a CCSDS Blue Book [47]. All these modifications have been implemented in ION software. As a result, this software becomes an important point of reference for the latest routing and forwarding mechanisms for space DTNs. The current version 3.5.0 of ION was released in September 2016 and is available as free software [33].

Even though the latest version of the CGR algorithm is implemented as part of the ION 3.5.0 open-source code (in fact, the CGR algorithm implemented in ION 3.5.0 includes a few parameters and procedures related to opportunistic CGR (O-CGR) [48]; O-CGR is an experimental CGR extension that also considers discovered contacts in addition to those in the contact plan; since in this work all contacts are scheduled, the probabilistic calculations are not discussed nor described in this section), there is no detailed and formal description of the algorithm available yet (the CCSDS documentation is still under development [47]). As a result, in this section, an in-depth explanation of the CGR scheme is provided. For the sake of clarity, the discussed algorithms are structured following their implementation in ION; however, they can be translated to more compact and elegant expressions without continue, break, and return statements.

The CGR Forward routine depicted in Algorithm 1 is called at any network node every time a new bundle is to be forwarded. Initially, the algorithm checks if the local view of the topology expressed in the contact plan was modified or updated since the last call (Algorithm 1, lines () and ()). If modified, a route list structure, holding all valid routes to each known destination, is cleared in order to force an update of the route table. Indeed, the route list is derived from the local contact plan . It is interesting to note that, in contrast with Internet routes, DTN routes are expressed in function of time. Therefore, they are only valid for a given period of time and need to be revisited by CGR for every new bundle. Next, the procedure populates a proximate nodes list comprising all possible nodes that, according to the route list , have a valid path towards the destination (Algorithm 1, lines ()). This step is executed by the identifyProxNodes routine which is detailed in Algorithm 2 and discussed below. An excluded nodes list is used in this step to avoid the consideration of administratively forbidden neighbors (e.g., unresponsive nodes) or the previous bundle sender to minimize route loops (Algorithm 1, lines () and ()).

   input: bundle to forward , contact plan , route list , excluded nodes , proximate nodes
   output: bundle enqueued in the corresponding queue
()  ;
(2)  if    changed  since  last     calculation   then
(3)     ;
(4)  if    forbids return to sender  then
(5)      sender node;
(6)    identifyProxNodes (, , , );
(7)  if    is critical  then
(8)    enqueue a copy of to each node in ;
(9)    return
(10) set to empty;
(11) for    do
(12)   if    is empty  then
(13)        
(14)   else if    then
(15)        
(16)   else if    then
(17)        continue
(18)   else if    then
(19)        
(20)   else if    then
(21)        continue
(22)   else if    then
(23)        
(24) if    is not empty  then
(25)   enqueue to ;
(26)   manageOverbook  
(27) else
(28)   enqueue to limbo
(29) return
   input: bundle to forward , contact plan , route list , excluded nodes ,
   output: proximate nodes list
()  if    is empty  then
(2)        loadRouteList  ;
(3)  ;
(4)  for    do
(5)     if    then
(6)         continue  (ignore past route)
(7)     if    then
(8)         continue  (route arrives late)
(9)     if    then
(10)        continue  (not enough capacity)
(11)   if    then
(12)        continue  (next hop is excluded)
(13)   if  then
(14)        continue  (outbound queue depleted)
(15)   for    do
(16)        if    then
(17)         if    then
(18)          replace with
(19)         else if    then
(20)          continue  (previous route was better)
(21)         else if    then
(22)          replace with
(23)         else if    then
(24)          continue  (previous route was better)
(25)         break
(26)   if    then
(27)        ;
(28)        ;
(29) return  

Once populated, the list can be used to forward the bundle to the appropriate neighbors. If the bundle is critical (a special type of bundle), the bundle is cloned and enqueued to all possible neighbors in the list (Algorithm 1, lines () to ()). If the bundle is of normal type, a single candidate node is chosen from the proximate nodes list . Neighbors with best arrival times to the destination are the top priority, then those with least hop count (in CGR terminology, one contact is one hop), and finally those with a smaller node id (Algorithm 1, lines () to ()). Then, if a suitable proximate node is found, the bundle is inserted in the corresponding outbound queue before executing the overbooking management procedure (the overbooking management procedure defined in [58] aims at reordering the local outbound queues when bundles with higher priorities replace less urgent bundles; since in this work we assume all bundles have the same priority, this procedure is not described) (Algorithm 1, lines (25) to (26)). However, if the CGR Forward fails to find a suitable neighbor, the bundle is stored in special memory space called limbo waiting for a higher level process to either erase it or retry a new forwarding later (e.g., after a contact plan update). As a result, after the CGR routine is completed, one or more bundles might be stored in the local memory waiting for the contact with the corresponding proximate node. As discussed later, if, for one reason or another (i.e., congestion or failure), the bundle is not transmitted during the expected contact, it will be removed from the queue and rerouted by this CGR routine.

The identifyProxNodes routine, depicted in Algorithm 2, explores existing routes in order to derive a proximate node list . The list is used by the main CGR routine and is formed by a set of nonrepeating neighbor nodes that are capable of reaching the bundle destination. If the route list is empty (i.e., first time the routing routine is executed after a contact plan update), the load route list function is called in order to find all routes towards the destination of (Algorithm 2, lines () and ()). At this stage, accounts for all possible routes to the destination. Each of them needs to be evaluated in order to populate the proximate node list . Initially, those routes that do not satisfy specific selection criteria are discarded (Algorithm 2, lines (3) to (14)). In particular, routes with the latest transmission time () in the past, an arrival time later than the bundle deadline, a capacity less than the bundle size, and a proximate node within the excluded nodes, or that require a local outbound queue that is depleted, are ignored. Remaining routes are then considered suitable, and the corresponding proximate node in the list is either replaced by a better route (Algorithm 2, lines (15) to (24)) or directly added to the list (Algorithm 2, lines (26) and (27)). The replacement criteria is coherent with Algorithm 1: best arrival time is considered first and then route hop count. During this process, necessary route metrics such as arrival time and hop count of the best route are also stored in each proximate node data structure contained in .

The routines in Algorithms 1 and 2 are executed on a per-bundle basis. The reason behind this is that the parameters of each bundle (destination, deadline, and size), the local outbound queue status, and the excluded nodes list need to be revised on every new forwarding in order to base the decision on an up-to-date version of the proximate nodes list . In general, these routines are considered part of a forwarding process of CGR. On the other hand, the route list () will need to be updated whenever the local contact plan is modified. The determination of the routes is considered part of a routing process of CGR and is described below.

In general, to find all possible routes from a source to a destination, CGR uses a contact graph expression of the contact plan. A contact graph is a conceptual directed acyclic graph whose vertices correspond to contacts while the edges represent episodes of data retention (i.e., storage) at a node [47]. Also, two notional vertices are added: the root vertex, which is a contact from the sender to itself, and a terminal vertex, which is a contact from the destination to itself. Even though the resulting contact graph structure may seem counterintuitive, it is a convenient static representation of a time-evolving topology that can be used to run traditional graph algorithms such as Dijkstra’s searches. For example, Figure 3 illustrates the contact graph corresponding to the network shown in Figure 1. The three discussed routes with their corresponding metrics are also included in the illustration (note that another feasible path from A to D exists through contacts 1 and 9).

In order to find all possible routes in the contact plan, the load route list routine performs a series of Dijkstra’s searches over a contact graph derived from the contact plan. The algorithm is listed in Algorithm 3 and is described as follows. A work area is reserved for each contact in the contact plan in order to run Dijkstra’s searches. Initially, the algorithm clears all the required parameters in each working area of the contacts (Algorithm 3, lines (2) to (7)). Then, the routine loops to find different routes using Dijkstra’s algorithm (Algorithm 3, line (10)). The metric driving the shortest path search is the arrival time, which must be calculated from the starting time and expected delay of each contact in the explored path. In order to guarantee that each Dijkstra execution provides a distinct route, the load route list process removes the limiting contact of the last route found from the following search. The limiting contact is defined as the earliest ending contact in the route path [47]. In general, the limiting contact often happens to be the first contact of the route (e.g., see contacts 1, 7, and 5 in Figure 3). Therefore, removing the limiting contact forces the shortest path search to provide the next best route towards the destination (Algorithm 3, line (13)). However, in some special cases, the transmitter node is behind a very long contact (e.g., an Internet contact). For these cases, an anchoring mechanism allows the algorithm to find several routes through longer contacts. The anchor search begins as soon as the algorithm detects that the first contact is not the limiting contact (Algorithm 3, lines (26) to (28)). In this stage, the anchor contact is stored and the limiting contact in the path is found and suppressed for further calculations (Algorithm 3, lines (29) to (34)). After clearing the working area (Algorithm 3, lines (35) to (38)), a new search is executed, now with the limiting contact detached from the contact graph. As soon as the first route without the anchor contact as the first contact is found, the anchored search ends suppressing the anchor contact and the normal search continues (Algorithm 3, lines (15) to (24)). The search ends when no more routes can be found in the contact graph.

   input: bundle to forward , contact plan
   output: route list
()  ;
(2)  for    do
(3)         ;
(4)         ;
(5)         ;
(6)         ;
(7)         ;
(8)  set to empty;
(9)  while    1  do
(10)       Dijkstra  (, , );
(11)      if    is empty  then
(12)        break  (no more routes in contact graph)
(13)      ;
(14)      if    not empty  then
(15)        if    then
(16)         for    do
(17)              ;
(18)              ;
(19)              ;
(20)             if    then
(21)               ;
(22)         ;
(23)         set to empty;
(24)         continue  (go to next Dijkstra’s search)
(25)      ;
(26)      if    then
(27)         ;
(28)      else
(29)        ;
(30)        for    do
(31)         if    then
(32)             ;
(33)             break  (limit contact found)
(34)      ;
(35)      for    do
(36)          ;
(37)          ;
(38)          ;
(39) return

Although a complex and extensive algorithm, CGR is considered to be among the most mature strategies towards forwarding and routing in space DTNs [30]. To the best of authors’ knowledge, this section constitutes one of the most detailed overviews of the algorithm in the literature as of today. The correctness of such description is supported by the implementation of CGR in a simulator platform where the satellite network is submitted to faults as described below.

2.2. Fault Model

Over the last years, the semiconductor industry has been particularly concerned by the effects of radiation on integrated circuits and embedded systems in general [59]. The rationale behind this motivation lies not only in the use of these systems in radioactive environments but also in the increasing degree of integration of devices embedded in the same chip. Recent studies have shown that the smaller the feature sizes, the greater the sensitivity to radiation-induced errors [60]. As a consequence, modern embedded systems may be susceptible to low-energy particles including those observed within the Earth’s atmosphere.

This effect is even more dramatic in space missions, which require systems that can operate reliably for long periods of time with little or no maintenance. This is the case in satellite constellations under study in this work. Among the possible errors, transient errors occur in the system temporarily and are usually caused by radiation interference, also known as single event upsets (SEUs). In particular, any circuit comprising memory elements (registers, flip-flops, internal memory, etc.) can at any moment undergo the modification of one or several bits of information due to ionizing particles. Other transient outages might be provoked by overheating, radio-frequency interference, or a processor reboot following a software exception. Traditionally, missions are designed to tolerate these failures by detecting the erroneous behavior and then recovering the system, typically by means of a full restart [59].

The random occurrence in time and space of such failure phenomenon, the probability of the error to happen, and the probability of effectively detecting an unwanted behavior, can be modeled by means of an exponential (Poisson) distribution [61]. The exponential model makes emphasis on the number of failures, the time interval over which they occur, and the environmental factors that may have affected the outcomes. As a result, it provides two feasible outputs at every moment: normal operation or failure. The model does not assume wear-out or depletion (as in batteries), implying that it only accounts for failures occurring at random intervals but with fixed long-term average frequency. Indeed, the outcome of this kind of failure model is also known as memoryless distribution [62]. The exponential distribution model is the most commonly adopted fault model, mainly due to its simplicity and effectiveness. It takes a single known average failure rate parameter and can be conveniently described by means of

In (1) is known as the failure rate and is the time. The failure rate reduces to the constant for any time: this proves that the exponential model is indeed memoryless. One of the input parameters for obtaining the modules outages is the Mean Time To Failure (), which is . Moreover, while the time to failure is determined by the exponential model with the MTTF, the Mean Time To Repair () defines the time to recover after the failure. Both the error detection and system recovery mechanism are presented as part of the interval. Accurately determining both and for a real spacecraft component was investigated in [63].

By means of the , , and the fault model, a fault injection system was designed to study the resulting traffic flow of a DTN based on CGR algorithm under failure conditions. In particular, and are used as parameters to randomly select fault position (i.e., node to fail), starting point, and end point in the DTN network. This phenomenon is illustrated by an example in Figure 4, where a failure in node B at 1150 s and a subsequent recovery at 1450 s forces making a partial usage of route 1 and forbids sending any data through route 2. In this particular example, 50% of route 1 data is kept stored in node B memory after the failure is found (i.e., a reliable storage is assumed in each node). Once recovered, the node finds it has bundles stored for a previous contact that need to be reforwarded. As per the description of Section 2.1, CGR will find that the ongoing direct contact with D (from 1400 s to 1500 s) is a valid path. However, all data that node A was supposed to forward to B via route 2 will need to be reforwarded to an alternative path. In CGR, this will happen after the contact ends with bundles in the outbound queue. As a result, the overall throughput of the DTN system as well as the delivery time of the data might be degraded by transient errors that block the calculated route path.

Although this example may be intuitively obvious, the complexity of the analysis drastically increases as more nodes, more contacts, more traffic flows, and more failures are injected in the network. Furthermore, the system degradation is expected to depend on the network topology and the derived contact plan. In consequence, the CGR algorithm and the exponential failure model were integrated into a single simulation framework designed to provide accurate measurements of the degradation of the metrics of DTNs in presence of transient failures.

3. Simulation Analysis

3.1. Simulation Platform

There exist several tools to evaluate DTNs. Among them, the ONE simulator [64] has been extensively used for DTN studies; however, the platform is specifically designed to model opportunistic DTNs (social networks, vehicular networks, etc.). On the other hand, emulation environments such as CORE [65] are available to directly test existing DTN implementations; but CORE has to be executed in real time, which hinders its application in extensive analysis. To tackle these limitations, a new simulator called DtnSim (DtnSim is still under development; however, the code will be made publicly available under an open-source license at https://bitbucket.org/lcd-unc-ar/dtnsim) was implemented in Omnet++ [66], a discrete event network simulator platform. Using an event driven framework allows DtnSim to efficiently simulate scenarios at accelerated speeds. This is crucial for space environments where analysis over orbital periods spanning several days or weeks of duration is required. Indeed, the architecture of DtnSim is specifically designed to evaluate scheduled DTNs such as satellite and Deep Space systems.

An indefinite number of DTN nodes can be spanned and configured by a single file in DtnSim. Each of these nodes are based on the layered architecture illustrated in Figure 5. This architecture is a simplification of the original DTN architecture [12] that has been adapted for DTN-based satellite systems in [10] and is comprised of an application (APP), network (NET), and Medium Access Control (MAC) layer. Physical layer effects such as bit error rate can be modeled within the MAC layer if required.

The APP layer is the element that generates and consumes user data. In general, in the case of Earth observation missions, a large amount of information is produced in on-board remote sensing instruments and is required to be delivered to a centralized node on Earth [10]. On the other hand, in communication or data relay systems, the data is generated or demanded by end users on ground and is generally sent via satellites to another node on ground (i.e., Internet gateway or mission control center). Typically, the traffic exchange in communication-oriented DTN systems is of the publish/subscribe type, instead of a client/server as traditionally seen on Internet. A publish/subscribe system allows nodes to autonomously generate and send data to those nodes which can potentially be interested without relying on instantaneous feedback messages, a key principle in DTN. As a result, in DtnSim, the APP layer allows generating either a large amount of punctual traffic or periodic amounts of data.

The NET layer is the element in charge of providing delay-tolerant multihop transmission (i.e., routes) and is probably the most mature module of DtnSim at the moment. In this layer, each DTN node includes a local storage unit (nonvolatile memory model) in order to store in transit bundles. Also, the CGR routing algorithm described in Section 2.1 was implemented as an exchangeable submodule of the network layer; thus, other algorithms can easily be supplied via a clearly defined interface. Independently from the routing submodule, the NET layer can take a contact plan as an input which defines the forthcoming contact opportunities among the nodes. The contact plan format is based on the format used in the ION software stack: a text file comprised of a list of contacts in the following form:contact  <start>  <end>  <source>  <destination>  <rate>

The MAC layer of the DtnSim node is designed to provide a reliable wireless link and to multiplex the shared medium among nodes with wireless interfaces. Although each DtnSim node is based on a single APP and NET layer, several MAC modules can be attached to the NET module in order to mimic various transceivers on a single node. Thus, this layer sets the real bundle transmission rate which will depend on the transmitter/receiver module bit-rate as well as any possible medium arbitration mechanism (e.g., contention). An example analysis based on this layer has been presented in [10], where dynamic and static channel negotiations are compared. For the sake of simplicity, the MAC layer used for the present analysis transparently sends and receives bundles without further intervention. Thus, the same module is used to communicate nodes connected through Internet on ground.

Finally, DtnSim was extended with a fault-injector module based on the exponential model described in Section 2.2. In particular, the MTTF and MTTR are provisioned to each node so as to determine when the node is in a failure or normal operation state. When a node is in a failure state, the node is not able to send nor receive bundles by any of its MAC interfaces. However, if it has bundles stored in the local storage, they are not modified. Such a model mimics existing nonvolatile spacecraft data recorder systems typically on board of medium and high-end spacecraft. Nonetheless, the fault model can be easily extended to delete or corrupt stored data.

3.2. Simulation Scenarios

In order to assess CGR behavior under transient failures in realistic delay-tolerant satellite constellations, two appealing and realistic configurations are proposed and discussed below: a sun-synchronous along-track and a Walker-delta formation. Both constellations are based on 16 cross-linked LEO satellites (max. link range of 1000 Km at 500 km height), 25 ground target points, and 6 ground stations. The specific orbital parameters of the satellites are summarized in Table 3 while the ground stations and target locations are in Table 4 (the resulting contact plans used for the simulations as well as the STK [67] scenario of the Walker formation can be found in https://upcn.eu/icc2017.html). Systems Tool Kit (STK) [67] software was used to propagate these parameters for an analysis period of 24 hs. An intuitive illustration of the node’s location on a world map is provided in Figure 6. The left side of the picture plots the ground tracks of the Walker formation while the along-track is on the right.

3.3. Simulation Results

In the along-track formation, all 16 satellites are equally spaced and follow a very similar orbital trajectory. In this analysis, each satellite is able to reach the next neighbor in the front and in the back along the trajectory. Among the many benefits of such formation, satellites do not require complex transfer maneuvers if launched from the same vector. Also, since satellites perceive similar gravitational perturbations, significant savings in propellant for formation-keeping can be made [68]. From a communications perspective, the topological stability of this formation also favors the simplicity of fixed antennas against complex gimbal mounts or electronically steered antennas for ISLs. Similar topologies have been used in previous satellite DTN studies [10].

On the other hand, a Walker constellation pattern is defined by 4 orbital planes each with 4 satellites. In comparison with along-track, this constellation provides wider coverage at the expense of reduced ISLs communication time. Indeed, as shown in Figure 6, contacts among satellites are only feasible when orbital planes cross. This setup is known to provide efficient communication opportunities [69] and can also be used for DTN studies.

To the best of the authors’ knowledge, this is the first time these topologies have been compared within the DTN communication paradigm.

The chosen constellations are suitable for Earth observation missions [10], data-collection, or high-latency communication systems [11]. If used for Earth observation, the ground target locations would represent points of interest from which on-board instrumentation can acquire optical or radar images, or other remote sensing data. If used for data-collection or high-latency communication systems, ground targets would stand for ground-based equipment relaying either science or local user data (a.k.a. cold spots). In both cases, data sent from ground targets would be addressed via orbiting satellites to a centralized Mission Operations and Control (MOC) reachable through Internet (i.e., any of the 6 ground stations). Indeed, a MOC could act as an Internet gateway to deliver traffic to otherwise inaccessible ground targets. As a result, the publish/subscribe traffic pattern analyzed in the simulation is bidirectional: from all ground targets to the MOC and in reverse. In particular, each ground target will generate a bundle of 125000 Bytes per hour to be delivered to the MOC. In turn, the MOC will send one bundle of equal characteristics to each ground target also every hour. In both cases, traffic generation will only occur during the first 10 hs of the 24 hs of simulation. Therefore, 250 bundles will flow to the MOC and 250 to the ground targets in the return path. Also, the transmission data-rates for both intersatellite and Earth-to-satellite links were set to 100 Kbps, which can be obtained from a state-of-the-art CubeSat transponder (a CubeSat is a type of miniaturized satellite that is made up of multiples of 10 × 10 × 11.35 cm cubic units [70]; CubeSats are gaining increased popularity as they leverage existing Commercial Off-The-Shelf (COTS) components providing a cost-efficient alternative to building distributed satellite constellations) [71].

Simulation results are plotted in Figure 7. The abscissa axis shows the variation in MTTF including an infinite (INF) value standing for a single simulation execution without the occurrence of failures. Indeed, measurements at this last point in the horizontal axis stand for a reference performance for each of the proposed constellations. For the rest of MTTF values (MTTF from 100 s up to 2200 s were considered with a step of 300 s), resulting metrics are averaged over 160 simulation runs and then averaged over bundles or nodes accordingly. On the other hand, the mean recovery time (MTTR) was set to 5 minutes mimicking a full system reboot. Failures were enabled only in orbiting satellites; ground stations and ground targets were not assumed to fail in this analysis.

The effective fail time curve on the left shows the accumulated amount of time that a given node in the network refrained from transmitting a bundle either because of a local or a remote (i.e., next-hop node) fault condition. In general, the along-track formation exhibits a higher effective fail rate which is consistent with the higher connectivity of a permanently connected constellation. In other words, the probability of finding a failure during a contact is higher in the along-track than in the Walker system.

A not so intuitive result is shown in the total bundles received curve. This metric measures the quantity of bundles that arrived at the final destination (application layer). The Walker constellation is able to deliver the full traffic load of 500 bundles for MTTF of 700 s and higher. However, the along-track formation is unable to deliver such load even without any failures injected in the nodes (453 bundles reach the destination in this case). After a thorough analysis of the simulation traces, it was found that a pathological routing behavior impeded CGR to find all feasible routes leaving certain bundles without valid routes (i.e., in the limbo). These miscalculations are magnified as the failure rate increases. The issue found in CGR statement is related to the anchoring concept and is discussed in detail in Section 4.

The mean bundle delay curve is shown with the absolute mean delay value measured in minutes as well as relative to the metric observed without faults in the network (i.e., the delay for infinite MTTF). These results confirm that the Walker formation provides a better overall bundle delivery time. However, the along-track system results significantly more stable with the variation of the MTTF. In other words, the along-track formation is less sensitive (i.e., more robust) to faults than the Walker constellation. This effect is clearly observed in the relative expression of the mean bundle delay. It is interesting to note that given the delay-tolerant nature of both systems, the fact that latency remains in the order of minutes is completely reasonable.

The rerouted bundles curves are also expressed in absolute and relative format. These results show the number of bundles that had to be routed while in transit because a contact ended with a nonempty outbound queue. As previously discussed, this happens when the capacity allocated by the first CGR execution did not match the real capacity in the system. Successive CGR executions are necessary either because of failures or congestion problems [4951]. Indeed, the result without failures (infinite MTTF) shows the quantity of rerouting required due to congestion in both constellations. The relative curve thus evidences that rerouting in Walker constellation increases more dramatically than the along-track in the presence of failures. Such an increase can also be justified by the higher delivery rate of this formation. This metric then confirms that the along-track formation results more insensible towards higher fault rates.

The mean bundle hop count curve shows that the Walker constellation makes a more extensive usage of multihop paths. Such a feature is only observed in the along-track system in the absence of failures. For all other cases, the along-track constellation uses 3 hop paths (MOC to ground station, ground station to satellite, and satellite to ground target) meaning that several ISLs opportunities are underutilized in the presence of failures. Moreover, the along-track system evidences higher total transmitted bundles in most cases. This curve measures the total quantity of bundles transmitted by all nodes in the network (either for local or remote traffic). This metric can thus be directly correlated with energy usage in the node transmitter. In this case, the along-track not only uses lower hop counts but also evidences a higher energy usage due to bundle transmissions.

The last two plots on the right provide a metric on the storage used by the nodes. The upper curve depicts the overall spacecraft data recorder (SDR) memory used (outbound queue buffers occupancy) while the lower one shows the quantity of bundles that remains in the limbo at the end of the simulations because of the absence of feasible routes. In general, a significant number of bundles remains in the limbo for the along-track formation. Also, the Walker constellation evidences a lower memory utilization. This is coherent with previous analysis and confirms the inability of CGR of finding all feasible routes in the along-track topology. Such a problem is not observed in the Walker system which also features a significantly lower storage utilization as shown in the bundles in SDR plot.

4. Discussion

In order to describe the CGR inability to find all feasible routes in the along-track constellation, a simplified along-track topology and its corresponding contact graph are presented in Figures 8(a) and 8(b), respectively. In contrast to the Walker formation, the along-track formation exhibits a remarkable link redundancy since a single point on Earth is able to simultaneously reach several orbiting nodes while each satellite has simultaneous access to the front and back neighbor. In this example, a ground station (GS A) can reach a satellite (SAT B) which in turn can reach the front and back satellites in the constellation (SAT and SAT ). Eventually, the three satellites will be able to reach a ground target (GT ) when the constellation passes over that ground area (notice that contacts 4, 5, and 6 might occur later in time requiring a temporal storage of data in the satellites).

Even though the existence of several paths towards is desirable from a reliability perspective, it requires a routing algorithm that is able to discover and manage several parallel paths towards a given destination. However, the CGR specification discussed in Section 2.1 was found to overlook several valid paths in this kind of scenarios. In this example, the first Dijkstra search executed in node will find a valid route towards , let us say via contacts 1, 2, and 5 (cyan arrow). The limiting contact in this path is contact 1, which will end before the ISLs which are permanent in an along-track formation. As previously discussed in Algorithm 3, this contact will then be removed from the contact graph in order to begin the next search. But removing contact 1 will hinder the discovery of other two feasible paths via contacts 1 and 4, and contacts 1, 3, and 6. Therefore, node will refrain from sending more bundles than those that can fit in the discovered route (as discussed in Algorithm 2, route residual capacity must be enough to accommodate the forwarded bundle). The capacity in the remaining routes will remain underutilized favoring a partial delivery of bundles, increased delay, congestion, and an increased storage utilization as evidenced in Figure 7. It is worth noticing that this issue is not an implementation matter but part of the inner core of current CGR definition.

At the moment, the authors are investigating alternative DTN routing algorithms leveraging a contact graph. As part of these efforts, an alternative definition of the CGR algorithm is under development that can enhance current route discovery capabilities. A generalization of the anchoring concept might lead to a complete route discovery. Also, relying on K-Shortest Path (KSP) algorithms is also being considered [72]. Another future work involves the investigation of novel mechanisms that could exploit MTTF and MTTR parameters (generally known or estimated in advance) to support proactive fault-avoidance forwarding measurements. This research line might be based on similar approaches to the ones used in opportunistic CGR [48]. Finally, the presented analysis can be further extended by modeling high channel delays and more complex custody transfer protocols. In real networks, the lack of a custody acceptance is the means by which a DTN node can realize that a neighbor is unresponsive. Even though disregarded in presented simulations, timeout configurations of custody messaging and significant channel delays can play a significant role in a correct and timely reaction towards network faults.

5. Conclusion

In this work, an extensive analysis on the reliability of satellite-based Delay/Disruption Tolerant Networks (DTN) was presented. Two appealing and realistic Low-Earth Orbit constellations using state-of-the art routing algorithms were considered and compared by means of simulations. To this end, a unique overview of Contact Graph Routing (CGR) was provided and implemented in DtnSim, a novel space DTN simulator provided to the DTN community.

Results include the first evidence of the performance of Walker and along-track formations under different failure rates. As expected, the higher the failure rate, the more significant the performance degradation. The Walker formation proved to provide better delivery and resource utilization metrics, while the along-track was found more insensitive (i.e., robust) towards faults. Even though the intrinsic redundancy present in the along-track topology favors an improved fault-tolerance, analysis showed that current version of CGR was unable to make an optimal utilization of the communication resources. A proper identification of the algorithm weakness was presented and discussed based on the detailed overview of CGR.

The presented analysis becomes a solid starting point towards an improved CGR statement, which is currently under study by the authors. Also, future work includes the exploration of further CGR enhancements that could improve its robustness when implemented in fault-prone DTN systems.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

Part of this research was carried out in the frame of the “Calcul Parallèle pour Applications Critiques en Temps et Sûreté” (CAPACITES) project. Part of this research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. Government sponsorship is acknowledged.