Solving Engineering and Science Problems Using Complex Bio-inspired Computation ApproachesView this Special Issue
Recovery Routing Based on Q-Learning for Satellite Network Faults
With the fierce research on the space and terrestrial network, the satellite network as the main component has received increasing attention. Due to its special operating environment, there are temporary link failures caused by interference and permanent port failures caused by equipment problems. In this paper, we propose a new satellite network routing technology for fault recovery based on fault detection. Based on Bayesian decision, this technology judges the probability of each fault by a priori probability of the two faults to achieve the purpose of effectively distinguishing between two types of faults and locate faulty links and node ports. Then, corresponding to the previous two stages of the fault detection, different stages and different methods are updated for different types of fault. We also combine satellite network data from satellite simulation software to validate our study. The results show that the recovery strategy has good performance, and the effective resource utilization rate is improved significantly.
In recent years, emerging network service and application are driving the fierce development of space and terrestrial network and the sixth-generation mobile communication technology; satellite network, as an important part of both, has received more and more attention. In the space and terrestrial network, the satellite network and the terrestrial network form a complementary overall network . For the satellite network system, the ground part not only undertakes the coordination and management of the satellite network but also connects it with other terrestrial network systems. Therefore, the management of the satellite network cannot completely depend on the ground part, and it is better to have a certain pretreatment or autonomy. In addition, the satellite network in the space and terrestrial network usually has a multilayer complex structure. At the same time, the space environment is different from the ground. The ground network is located in the atmosphere and is surrounded by the Earth’s magnetic field. The environment is relatively stable. The space environment is susceptible to interference due to cosmic celestial activities, adjacent constellations, and ground communications equipment. Therefore, satellite networks undertaking complex service traffic fail frequently . Therefore, this paper mainly considers the fault detection of satellite network and the corresponding antidestructive routing after the fault. Corresponding to the characteristics of satellite networks, the main contents that need to be studied include the following:(i)The distinction between temporary faults and permanent faults. Charged particles in space will surround the outside of the antenna, forming a “shell,” which will temporarily block the transmission of the channel. As the interference disappears, the link will return to normal. This situation can be regarded as a temporary link failure. At the same time, in extreme conditions of extreme cold and no atmospheric protection, satellite node equipment will also experience equipment failure, such as high-energy charged particles caused by solar burst activities, and high-energy charged particles will shoot into the interior of satellite equipment, causing the equipment to fail to work properly. Such failures are difficult to restore to normal through manual maintenance, so displayed failures can be regarded as permanent.(ii)Differentiated route recovery in the face of different fault types. On the one hand, the traditional satellite network route  restoration technology does not consider the difference between the causes of the two types of failures, which may lead to the situation of abandoning a completely normal node in route restoration. On the other hand, the nodes involved in the failure are treated as a whole rather than by ports. Route restoration often avoids the failed node, finds a new node nearby to replace the failed node, or eliminates the full path rerouting of the failed node. These two reasons lead to the low utilization rate of the nodes involved in the fault in the general antidestructive routing, which affects the search of the optimal path of the antidestructive routing.
We first propose a new low-overhead information collection mechanism that collects and splits relevant information for successful transmission paths within time and then calculates reachable, unreachable, and unknown areas. On this basis, based on the Bayesian decision theory, when the known conditions are limited, the posterior probability index representing the posterior probability of different types of faults on each link is obtained through a priori probability of different fault types. Furthermore, the purpose of distinguishing the types of link failures is achieved. Unlike the maximum a posteriori probability decision in general Bayesian decision theory, by setting the posterior probability index threshold, multiple failed link detection results can be output, effectively solving the problem of detection quality degradation in multiple fault scenarios, more suitable for fault-prone satellite networks. The misjudgment results of this process can be verified in the second stage of further testing.
At the same time, based on the Q-learning algorithm in reinforcement learning, we propose a route recovery technology based on the above-mentioned fault detection. The collected information is used to update the Q-value table composed of two-dimensional state space and one-dimensional action space. For different types of faults, update the Q-value of the local state space and action space of different related nodes to achieve the purpose of distinguishing route recovery for different types of faults. At the same time, because the reward function consists of queuing time, transmission time, and link lifetime, the discount factor is also related to the link lifetime, which effectively reduces the impact of network dynamics on the stability of routing results and ensures the stability of path connections.
This article is organized into seven sections. In Section 2, the related works about fault detection for satellite networks and route for fault recovery in dynamic network are discussed. Section 3 consists of several sections, which, respectively, explain the mathematical model, steps, and advantages of the proposed satellite network fault detection mechanism. Section 4 is also composed of several sections, which, respectively, explain the mathematical model, steps, and advantages of the proposed satellite network antidestructive routing technology. In Section 5, we verified the proposed mechanism and technology and analyzed the simulation results. The conclusions are made in Section 6.
2. Related Works
Fault detection is indispensable in network management. There are currently a variety of fault detection algorithms applied to various networks, including artificial intelligence algorithms [4–6]. In the ground network, link failures can resume communication within a short time through retransmission and other technologies, but this is very inconsistent with the space environment . Some natural phenomena will cause the satellite communication link to be interrupted. The duration is longer than that of the ground network link and difficult to recover in a short time. This kind of failure that does not affect the device itself is often temporary and geographically relevant. At present, in satellite communication systems, fault detection is mainly used in satellite communication equipment, and the main methods include neural networks  and the adaptive observer-based method . As the backbone network of the space and terrestrial network, the timely detection and restoration of communication capabilities of satellite network faults ensure the healthy and stable operation of the network. Due to the characteristics of satellite network, such as precious resources and strong dynamics, Sun et al. propose a fault detection algorithm combining FTA (Fault Tree Analysis) and cluster method , which applies to LEO/MEO/GEO and ground stations. First, the space network is clustered to form a number of management domains. Based on the exchange of test information between the ground station and the cluster head, FTA is used to obtain further fault detection methods. However, this algorithm only considers node failures and does not consider link failures caused by transmission medium interference. The same applies to some studies [11–13]. The combination of expert system and C4.5 algorithm is proposed by Lin et al. to increase the online diagnosis capability for unknown faults . Although considered starting from the link perspective, the final decision is still the faulty nodes [12, 13].
Mobile ad-hoc networks (MANET) are similar to satellite networks, but MANET’s topology and link quality changes are more complex and changeable. Therefore, some achievements made by the network to ensure the stable transmission of data have certain references [14, 15]. For flying ad-hoc networks (FANETs), Rosati et al. expand on the optimized link-state routing (OLSR), use GPS to predict the change of link quality, and then enable routing protocols to track changes in network topology . Malviya and Tiwari introduced a modification to the dynamic source routing (DSR) protocol and proposed load-balanced multipath dynamic source routing (LMP-DSR), which uses multiple paths instead of the single path used in the original DSR . LMP-DSR can effectively improve the data transmission quality of MANET with complex network topology changes. Hierarchical State Routing (HSR) with better scalability is proposed by Pei et al. to shield the network dynamics by adopting the concept of logical subnet . Compared with the above networks, the satellite network has its own characteristics; for example, the satellite network topology changes periodically, resources are precious, and the environment is harsh. Therefore, the routing strategy of the above network is not directly applicable to the satellite network, and some modifications need to be made according to the characteristics of the satellite network. Most of the existing satellite network routing strategies treat faulty nodes as a whole rather than by ports, often avoiding the faulty nodes, finding a new node nearby to replace the faulty nodes, or eliminating the fault by full path rerouting of nodes [19–21]. On the one hand, this method greatly wastes node resources, while other ports can still undertake communication tasks. On the other hand, due to temporary link failures that do not consider natural factors, a completely normal node is discarded during route restoration. These two reasons lead to the low utilization rate of the nodes involved in the failure in the route recovery, which affects the search of the optimal path. Also, some researchers investigated the robustness from the flow aspect, which raises a new direction [22, 23], and Bachmann et al. discussed the robustness of interdependent networks . A reliable control design for networked control systems which is applied in satellite systems is proposed in . Fu et al. set up a cascading model for wireless sensor networks with different load-redistribution schemes, which reflects the traffic influence on the MANET networks . Zhao et al. proposed an extended recursive Cramer-Rao lower bound method to analyze the performance of wireless indoor localization signals, which is useful to enhance the reliability of another wireless scenario .
In this section, we will focus on the two types of fault differentiation detection mechanism based on Bayesian decision. First, we mathematically model the problem of fault detection. Then, the fault detection mechanism is explained. The communication overhead of the mechanism is reduced by collecting the path information of the successful route, and the periodicity of the satellite network is used to classify the fault into two stages.
3. Fault Detection Mechanism
3.1. Mathematical Model
3.1.1. Input Variables and Parameters
V: network adjacency matrix, representing the network topology. : assuming the total number of links in the network is l, the set of all links in the network . M: set of successful communication paths; assuming m successful communication paths and n nodes, the set expression is , , , . : represents a priori probability of a failure caused by a node port, and x represents the number of simultaneous failed ports. : temporary link failure a priori probability. R: reachable area, a set of nodes that a data packet from some node can reach through a certain path. : unreachable area, a set of nodes that cannot be reached through any path from a certain node. U: unknown area, the set of nodes starting from one certain node, which is not sure if the node can be reached due to limited information; generally, it is a set of nodes in the network that belongs to neither reachable regions R nor unreachable regions .
3.1.2. Objective Function
Assume that the satellite node has ports. Suppose that there are n failed paths, the total number of links on the path is f, and the value of h-f is the number of overlapping links in n failed paths; i = 1, 2, 3, …, f, …, h.
a posteriori probability of permanent link failure is as follows:
a posteriori probability of temporary link failure is as follows:
Equations (1) and (2) are the objective functions derived from Bayesian decision. The first term in equation (1) is a priori probability of permanent link failure. The numerator represents the probability of presenting the domain division results below in the case of a link failure, and the denominator is the probability of presenting the domain division results.
3.1.3. Constraint Conditions
For the same link, the events with a permanent link failure and a temporary link failure will be considered, relatively.
Equation (3) guarantees that the a priori probability of a permanent link failure is different from that of a temporary link failure. Equation (4) guarantees that the occurrence of a permanent link failure event and the occurrence of a temporary link failure event are independent events; that is, the probabilities of occurrence of the two events are not affected by each other. Equation (5) guarantees that at least one successful communication path can be obtained during the information acquisition phase.
3.2. Mechanism Process
This mechanism can use known flow information to detect faulty links in the network with low detection overhead. Distinguish different types of faulty links by the a priori probability of different faults. Meanwhile, combine threshold judgment and the method of new neighbor node detection to fit the multiple fault environment of satellite network.
The improved multitype fault detection mechanism of satellite networks based on Bayesian decision is as shown in Figure 1. The steps for the fault detection process are described below:(1)Determine the a priori probability of two types of failures. The two types of failures are temporary link failures caused by transmission medium interference and permanent link failures caused by problems with node port hardware equipment. According to the existing satellite network failure information, we can get the former a priori probability and the a priori probability of the satellite port failure . In this study, a satellite node is set to have four ports and the failure probability for a certain satellite node is The failure manifestation caused by a port is also a link failure. For a link, the a priori probability of a failure caused by a port is(2)Collect the path information of the successfully transmitted data within . In detail, it is required to collect and store the path and link information of successful data transmission in the network within the time , after triggering this fault detection mechanism.(3)Divide the network area according to the collected information. The specific division method is as follows:(a)Some nodes report that no receivable data packets have been received; add the nodes on the corresponding routing result to the set . The link is added to in the form of node pairs according to the transmission direction. The failure path is expressed by , and we could get reported path set .(b)In the time of , collect the links that successfully send and receive information through the path set M that is normally transmitted, and also join the set C in the form of node pairs according to the transmission mode.(c).(d)The starting node of the sending end of the path is defined as the initial node , and the receiving node is defined as , reachable area , unreachable area , and pending area .(e)For the starting nodes of other failed paths in the P set, if they are the same, add the receiving node of the path to , and if they are different, go to (6).(f)Compare the border node of reachable area R with the link in C. If there is a link between border node of R and link of , then add n to R. Repeat this step until no new exists.(g)The remaining nodes are added to the pending area U; obtain a fault detection model consisting of R, , and U. This model is built with one failure path as an example. The model building steps are repeated for each failure path. It should be noted that, in order to meet the special needs of satellite networks, this mechanism assumes that the receiving end node of the failed path belongs to the set of unreachable area . This assumption will cause an increase in suspected faulty links compared to the original definition, and the accuracy rate may decrease. But since the mechanism consists of two detection stages, the detection in the second stage will reduce the impact of this assumption on the overall performance of the mechanism.(4)Use the Bayesian formula to calculate two types of failure probabilities for each link: This algorithm is triggered by the reporting path, and there will exist multiple reporting paths. The calculation process of this step needs to be repeated for each reporting path. “Each link” here refers to the reporting path corresponding to this calculation. The failure probability of a link is expressed as represents a link failure, and belongs to the set of links that report the failure path. In order to compare the probability of link failures, a simplified a posteriori probability can be obtained from the Bayesian formula . The equation above is the final calculation target, and after the a priori probabilities of the two types of failure are substituted into the calculation, we obtain the probability indexes of the two types of failure. Where X is some event, , c is cut. The number of c is also different according to different network conditions, and it is obtained from the three divided areas.(5)Determine whether a certain faulty link exists. A certain faulty link is one reported faulty path that only has one output result; that is, if there is only one possible faulty link, it is a certain faulty link. If it exists, remove this link in the topology map and go to (3); if there is no certain faulty link, go to (6).(6)Obtain the output fault link result according to the threshold.(7)Judge the failure rate. The judgment limit here is obtained based on the following simulation results. If the failure rate is less than 20%, go to (8); if the failure rate is more than 20%, go to (9).(8)Perform classification and judgment of the fault type in the first stage, and perform preliminary fault type judgment on the faulty link through the fault probability index calculated in (4).(9)Perform the second stage fault type judgment. In this stage, we use the characteristics of the topology change of the satellite network and the new neighbor nodes or new communicable nodes after the topology change to do further failure type detection for failure links of unidentified type. For the research object in this article, the polar orbit constellation, the intersatellite link can be divided into non-seam-area link and seam-area link. For non-seam-area link, the newly communicable ground stations or the newly connectable satellite nodes can be used for further fault type detection, after the topology changes. For seam-area links, the new neighboring nodes after the topology change can be directly used for further detection. Further detection results include fault classification of unclassified faulty links and identification of specific faulty ports for faulty links caused by node port failures.(10)Obtain the output of the mechanism, and obtain the fault link location and fault type. Two points need to be added: There is a special case in (3): when there are only reachable and unreachable areas in the network without unknown area, the calculation of (4) cannot be performed. This means that there must be at least one link with one end in the reachable area and the other end in the unreachable area. Such a link must be a failed link; otherwise, this situation will not exist. These have not been calculated in (4). These faulty links are directly detected by (9). In (6), because the decision is made by the threshold, the output fault link results include the real fault link and the misjudged normal link. In the second stage of detection process in (9), the output faulty links are all detected for fault classification and the faulty ports are determined. Misjudged links are effectively eliminated in this step.
4. Route Recovery Technique
In this section, we will focus on the route recovery technique, which is based on the failure detection mechanism above. First, we mathematically model the research problem. Then we describe the process of the route recovery technique. Corresponding to the two stages of fault detection, this technique uses different route update methods to recover the routes of nodes related to different types of faulty links.
4.1. Mathematical Model
4.1.1. Input Variables and Parameters
: current time.
: the moment of the last on/off change of a link.
: queue delay.
: the length of time during which the link remains connected.
The state space is a two-dimensional state space . The current node and the destination node jointly describe the system state.
The action space is , which is composed of the neighboring node set of the current node .
The return value is , which is negatively related to the cost of using the current node as the sender and passing the neighbor node of to the destination node .
4.1.2. Objective Function
Assume that the starting node of the route is and the destination node is . There are n nodes in the network.
The route is as follows:where ; for the formula, m = 3, 4, …, n.
Assume that l is the number of iterations. It means that the return value has converged, where i, k = 1, 2, …, n, and the range of z is related to the number of neighbors.
4.1.3. Constraint Conditions
For any moment,
Equation (12) ensures that the time direction is positive; that is, is the latest link on/off time in the positive time direction. At the same time, constraint 1 also ensures that the return value table of related nodes is updated before the network topology changes. Equation (13) ensures that the routing principle is the maximum return value.
The above model can be expressed in a Q-learning-based routing algorithm  for each node in the network to maintain a Q-value table, as shown in Table 1. : Q-value table maintained by node . : Neighbors of node , including and ; the specific number depends on the number of neighbors of node . : Destination nodes, including and ; the specific number depends on the number of network nodes. : The return value of the node as the sending end, passing the neighboring node to the destination node as .
4.2. Technical Steps
The process of route recovery technique is shown in Figure 2.(1)Initialize the Q-table of each node.(2)The satellite and the ground station exchange management information and confirm the time information. The ground station sends each updated Q-value table, discount factor γ, and learning rate α to each satellite node. After receiving the latest Q-value table, the satellite node sends the queuing time of data transmission of each node to the ground station used to update Q-value table in real time.(3)Determine whether a routing failure path triggers the fault detection mechanism. If it exists, go to (4); if it does not exist, go to (7).(4)Update the Q-value table according to the detection result of the first stage of the fault detection mechanism. The Q-tables of the two related nodes of the link and their neighboring nodes are updated as follows: F and indicate nodes at both ends of the faulty link, the data traffic sending end is F, the receiving end is , and Y and represent the remaining neighbors of F and except the other party.(5)Update the Q-value table according to the detection results in the second stage of fault detection mechanism.(6)Organize the fault situation, rebuild the topology at each moment, and go to (1).(7)When the network is running normally, update the Q-tables according to the following formula: is the current time, and is the time when the expected network link changes as shown in (1). is the length of time during which a link remains connected.(8)Determine whether the topology has changed. During normal operation of the satellite network, the topology changes due to the intersatellite link switching. If the intersatellite link switching causes the network topology to change, go to (9); otherwise, go to (2).(9)Perform a Q-transfer of the relevant link to the relevant node where the link switch occurs.
5. Experiment Results and Discussion
5.1. Fault Detection Mechanism Performance
In this simulation, a priori probability of link failure is 0.1, and a priori probabilities of satellite node 1∼4 ports failure are 0.08, 0.005, 0.001, and 0.0005, respectively. The link delay is set to about 20 ms, and the information collection time is set to 3 s. The topology adopts a static topology, and 30 flows are randomly set to simulate a real satellite network communication environment. The topology used is 6 × 6. According to the preset a priori probability of each failure, the link failure threshold probability index value is set to 0.01, and the port failure threshold probability index value is set to 0.002. For both cases where the probability index is less than the threshold, follow the following judgments: When the port failure probability index is less than 0.0001, the link failure is determined; when the port failure probability index is greater than 0.0001 and the link failure probability index is less than 0.0015, the port failure is determined. When the failure probability index of both is greater than the threshold or other cases where the type of failure cannot be determined at this stage, further type detection is performed in the second stage.
There are two comparison algorithms of the evaluation index of this mechanism (Improved Fault Detection Method Based on Bayes Decision Theory, IDBB), which are Centralized System-Level Fault Diagnosis (SLD)  and Distributed Fault Diagnosis Algorithm for Satellite Network (DFDS) . The SLD method uses a centralized diagnosis strategy. Neighboring nodes in the network send test tasks to each other and return the obtained test results. The computing centre collects the results of these test tasks and locates the faulty node through a certain probability theory strategy. The DFDS algorithm is a distributed network fault diagnosis algorithm suitable for autonomous processing on the satellite. It proposes an M-probability distributed test model based on the system-level fault diagnosis theory.
As shown in Figure 3, when the failure rate is less than 10%, that is, the network failure is less, the IDBB mechanism proposed in this paper has a slightly lower accuracy rate in the first stage than the SLD method and DFDS algorithm, but when the failure rate is greater than 10%, that is, in the case of a large number of failed links, the accuracy of the IDBB detection results is significantly higher than the SLD method and the DFDS algorithm in the first stage. Based on the final output of the mechanism, the IDBB mechanism proposed in this paper performs significantly better than other algorithms.
The reasons for the above results are as follows:(i)In the case of a low failure rate, the accuracy of the first-stage detection result of the IDBB mechanism is slightly lower. The result is that when a small-scale failure occurs, it has a higher probability of occurrence that the multiple failure paths that report routing failures are caused by the same failure link. Since the detection mechanism detects links on multiple failed paths one by one, when there is only one suspected faulty link on a path, it will be determined as a faulty link, and the faulty link will be removed from the topology and then be redetected until there is no new determined faulty link, and the suspected faulty link for each failed path is output. Therefore, when the cause of multiple failed paths is the same failed link, once the failed link is detected during the detection of a failed path, then when detecting other failed paths, the algorithm will still take this path to identify the possible faulty link and output it as a suspected faulty link, so when the failure rate is low, the detection accuracy of the IDBB mechanism in the first stage is slightly lower.(ii)In the case of a high failure rate, the detection accuracy rate of the IDBB mechanism in the first stage is significantly higher. The reason for this result is that the IDBB mechanism is triggered by routing failure. At the beginning of the algorithm, the path information of the successful routing in the network is collected and stored in the form of a link. During the detection, all the links on the failed path will be reported. The paths are compared and calculated one by one, making the mechanism more targeted for fault detection.
As shown in Figure 4, the detection completeness rates of the IDBB mechanism in the first stage and the second stage are equal. When the failure rate is less than 10%, the detection completeness rate of the IDBB mechanism, DFDS algorithm, and SLD method is almost 100%. When the rate is greater than 40%, the detection completeness of the DFDS algorithm and the SLD algorithm has decreased significantly, but the IDBB mechanism can still maintain a completeness rate of more than 90% when the failure rate is 60%.
The reason why the above results are generated by the simulation is as follows: The IDBB mechanism is to detect the failure of each reported path, which is more targeted. At the same time, the IDBB mechanism uses the threshold decision form after calculating the failure probability index, rather than selecting the link with the largest probability of failure, which makes the IDBB mechanism more suitable for multifault detection environment, so the completeness rate of fault detection is also higher.
In simulation, we use the number of detection packets to represent the overhead. As shown in Figure 5, in the four network scales under the four failure rates, the communication overhead of the IDBB mechanism is much smaller than those of the DFDS algorithm and the SLD method. Under a certain failure rate, as the network scale increases, three types of the communication overhead of detection methods have increased significantly. The overhead of IDBB mechanism is the smallest, and the overhead of SLD method is smaller than that of DFDS algorithm. With a certain network size, the IDBB mechanism’s overhead increases significantly as the failure rate increases. The overhead of the SLD method does not change much.
The reason why the above results are generated by the simulation is as follows: The IDBB mechanism sends test information during the second stage of detection process. The purpose is to further classify and verify the failed links locked in the first stage (including determining the faulty port/determining the temporality/whether the fault exists/eliminating the misjudgment link). Therefore, as the failure rate increases, the number of links that require further detection also increases. Therefore, under a certain network size, the detection overhead increases with the increase of the failure rate. However, both the DFDS algorithm and the SLD method adopt the form of “broadcast test task” when collecting network information and use the response information returned by neighboring nodes to detect failures, so the communication overheads of the IDBB mechanism in the four network scales under four failure rates are much smaller than those of the DFDS algorithm and SLD method. At the same time, the DFDS algorithm uses a distributed structure, and each node needs to grasp the detection results. The SLD method uses a centralized structure, and the control centre controls the detection results. Therefore, in the case shown in Figure 5, the overhead of the SLD method is less than that of the DFDS algorithm.
In the first stage, the effect of fault classification is better when there is a small range of faults. When the failure rate is less than 15%, the accuracy rate of fault classification can reach more than 80%, as in Figure 6. When the failure rate exceeds 20%, the accuracy of the fault classification in the first stage drops sharply. In the second stage, the accuracy of fault classification is high.
The reasons for the above results from the simulation are as follows: In the first stage, as the failure rate increases, there are many faulty links. When collecting link information and dividing the area on this basis, the results of area division will be inaccurate. Moreover, the greater the failure rate, the coarser the area division. Therefore, when calculating the link failure probability index, the classification accuracy will also decrease. In the second stage, further detection is performed based on the new neighbor node after the topology change and the newly established ground station or satellite node. As the failure rate increases, after the topology changes, the fault probability of the new neighbor node and the newly established connection also increases, and the probability of failing to correctly detect the type of failure also increases. Under such circumstances, the detection accuracy of the second stage will decrease accordingly.
5.2. Antidestructive Routing Technology Performance
In previous work, we designed a satellite network simulation platform . Based on the simulation platform, the simulation time is 6000 s, the fault link is set at 2040 s, and the sampling interval is 60 s. The ground station located in Beijing (117°13′E, 40°05′N) is used as the transmitting end, and the ground station located in Los Angeles (120°26′W, 34°05′N) serves as the receiving end. Two faulty links are set in the selected path: one is a permanent failure caused by a port failure, and the other is a temporary link failure caused by the transmission medium interference. The comparison target of this technology (Multitype Fault Detection Routing Strategy, MFDR) is the link fault recovery method based on Backup Path Routing Policy (BPRP) and a link failure recovery method based on Local Reroute Strategy (LRRS).
As shown in Figure 7, when the 34th minute failure occurs, the path delays of the three technologies all increase significantly. The delay of MFDR is larger than the path delay of BPRP and smaller than the path delay of LRRS. The delay of the BPRP immediately resumes regular fluctuations in the stable range. The delay of the MFDR path resumes regular fluctuations in the stable range after 37 minutes; that is, the topology of the LRRS path delays fluctuates slowly in the stable range. In the 39th minute, the stability range resumes regular fluctuations. After stabilization, the path delay of the BPRP is the same as the path delay of the LRRS, and the delay of the MFDR path is the smallest.
The reason why the above results are generated by simulation is as follows. After a fault occurs, BPRP reacts most quickly to the fault, immediately selects the backup path, minimizes the path delay, and keeps it stable for the next time. MFDR and LRRS only redefined the weights of related links, and the link weights of local related nodes are updated in the following time. MFDR starts to identify temporary faulty links at the 37th minute of topology change and restores the weights of related links at that location in the new topology as necessary. Therefore, after the 37th minute, the delay of the path using MFDR also tends to be stable, and, due to the identification of temporary faulty links, compared with BPRP and LRRS, the delay value is in a smaller interval. LRRS starts local rerouting after a fault occurs. After a period, the best path after the fault is selected, so the path selected after the restoration of stability is the same as the backup path.
Participating in the simulation of effective resource utilization are all 20 ground station terminals; 10 ground station terminals are set as data sending ends, 10 ground station terminals are set as data receiving ends, and the simulation time is 1200 s to 3000 s for a total of 1800 s; each emulates 20 times under the failure rate and the effective resource utilization is averaged.
The effective resource utilization rate of the network under MFDR is better than BPRP and LRRS at all failure rates, as in Figure 8. When the failure rate is low, the effective resource utilization rate of the network under MFDR is significantly better than BPRP and LRRS. As the failure rate increases, the effective resource utilization of BPRP and LRRS networks has increased slightly, while the effective resource utilization of MFDR networks has decreased slightly with the increase in the failure rate, and, in all cases, the resource utilization of LRRS has been slightly higher than BPRP.
The reasons for the above results from simulation are as follows:(i)Since MFDR includes a detection and recovery mechanism for temporary faulty links and a permanent fault detection mechanism that uses ports as the detection unit, it can lock the true faulty link to the greatest extent, so MFDR maintains a high effective resource utilization rate in all failure rates.(ii)In the simulation, some situations in which the MFDR cannot determine the type of link failure and the judgment error are considered. The probability setting is mainly based on the simulation result setting of the failure detection mechanism above. Therefore, as the failure rate increases, the accuracy rate of the fault type detection in the second stage will also decrease, and some temporary faults cannot be identified, resulting in a decrease in the utilization of effective resources in MFDR route restoration.(iii)Increasing the failure rate will also increase the probability that a failed link is caused by both temporary and permanent failures. In this case, BPRP and LRRS will select new routes during the route restoration process. There is less waste, so as the failure rate increases, the effective resource utilization of BPRP and LRRS increases slightly(iv)LRRS has the probability of temporary link recovery during the rerouting process, so it is higher than the BPRP effective resource utilization rate.
We have designed a satellite network fault detection mechanism that considers both transient and permanent faults and at the same time detects the permanent faults as node ports. Based on this mechanism, an antidestructive routing technology is proposed. It has the following advantages.
It solved the problem that currently it is impossible to distinguish between temporary and permanent failures. This mechanism introduces two factors of a priori probability of link failure, permanent node port failure and temporary link failure, due to transmission medium interference. Two types of failure a posteriori probability indexes are obtained through Bayes decision theory. Combining the threshold with the further detection in the second stage, it realizes the distinguished detection of temporary and permanent faults.
It reduced the communication overhead of the fault detection mechanism. The fault detection mechanism sends and receives test information and feedback results in the form of broadcast and traversal. Our mechanism collects the paths for which a connection is successfully established within a short period after a failure and expresses them via links, maximizing the use of known information, reduces the occupation of satellite network resources by the detection mechanism, and avoids increasing the burden on the network that has already failed and causing more serious consequences, which is more suitable for satellite networks with precious communication resources.
It solved the problem that there is no corresponding route recovery design for faults with different durations. This technology divides the route restoration process into two phases. According to the changes in the satellite network topology, the related nodes and their neighboring nodes of different durations are distinguished and operated to realize the timely utilization of available resources.
The software codes used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was supported by Beijing Natural Science Foundation (Grant no. 4182040) and Open Research Fund of State Key Laboratory of Space-Ground Integrated Information Technology underf Grant no. 2015_SGIIT_KFJJ_TX_03.
M. Tahmassebpour, “Detecting fault location in wireless sensor networks using an exploratory method based on fuzzy logic,” in Proceedings of the 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 465–469, IEEE, Tehran, Iran, December 2017.View at: Publisher Site | Google Scholar
H. Yang, B. Wang, Q. Yao, A. Yu, and J. Zhang, “Efficient hybrid multi-faults location based on hopfield neural network in 5G coexisting radio and optical wireless networks,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 4, pp. 1218–1228, 2019.View at: Publisher Site | Google Scholar
L. Li, Z. Wu, Y. Gao, and X. Zhang, “Actuator fault detection of satellite based on neural network observer,” in Proceedings of the 2018 5th International Conference on Information Science and Control Engineering (ICISCE), pp. 1067–1070, IEEE, Zhengzhou, China, July 2018.View at: Publisher Site | Google Scholar
Y. Sun, Y. Wang, L. Guo, Z. Ma, and H. Wang, “A fault detection design for clustered space information network based on FTA,” in Proceedings of the 2017 6th International Conference on Computer Science and Network Technology (ICCSNT), pp. 389–394, IEEE, Dalian, China, October 2017.View at: Publisher Site | Google Scholar
Y. Lin, S. Ding, Y. Wang, and J. Geng, “A method of satellite network fault synthetic diagnosis based on C4.5 algorithm and expert knowledge database,” in Proceedings of the 2015 International Conference on Wireless Communications & Signal Processing (WCSP), pp. 1–5, IEEE, Nanjing, China, October 2015.View at: Publisher Site | Google Scholar
Z. Zhao and J. Wang, “An autonomous iteration-based identification of faulty links in LEO satellite communication networks,” in Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control, p. 223, IEEE, Dalian, China, June 2008.View at: Publisher Site | Google Scholar
L. K. Malviya and D. Tiwari, “LMP-DSR: load balanced multi-path dynamic source routing protocol for mobile ad-hoc network,” in Proceedings of the 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–5, IEEE, Tiruchengode, India, July 2013.View at: Publisher Site | Google Scholar
T. Pan, T. Huang, X. Li, Y. Chen, W. Xue, and Y. Liu, “OPSPF: orbit prediction shortest path first routing for resilient LEO satellite networks,” in Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), pp. 1–6, IEEE, Shanghai, China, May 2019.View at: Publisher Site | Google Scholar
J. A. Boyan and M. L. Littman, “Packet routing in dynamically changing networks: a reinforcement learning approach,” in Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 671–678, ACM, San Francisco, CA, USA, November 1993.View at: Google Scholar
Y. Jiang, Y. Yao, and X. Liang, “A distributed fault diagnosis algorithm for satellite network,” Journal of Chinese Computer Systems, vol. 34, no. 11, pp. 2518–2523, 2013.View at: Google Scholar