Abstract

This paper mainly focuses on routing in two-dimensional mesh networks. We propose a novel faulty block model, which is cracky rectangular block, for fault-tolerant adaptive routing. All the faulty nodes and faulty links are surrounded in this type of block, which is a convex structure, in order to avoid routing livelock. Additionally, the model constructs the interior spanning forest for each block in order to keep in touch with the nodes inside of each block. The procedure for block construction is dynamically and totally distributed. The construction algorithm is simple and ease of implementation. And this is a fully adaptive block which will dynamically adjust its scale in accordance with the situation of networks, either the fault emergence or the fault recovery, without shutdown of the system. Based on this model, we also develop a distributed fault-tolerant routing algorithm. Then we give the formal proof for this algorithm to guarantee that messages will always reach their destinations if and only if the destination nodes keep connecting with these mesh networks. So the new model and routing algorithm maximize the availability of the nodes in networks. This is a noticeable overall improvement of fault tolerability of the system.

1. Introduction

In the last decades, the goal of many researchers was to study communication operations in networks with fixed topologies, including modeling architectures and routing algorithm of parallel computers and cluster or middle area communication networks (such as metropolitan networks covering a town or a small region). The quality of such networks strongly depends on correct and efficient execution of communication operations.

Direct networks [1] become a popular architecture for communication networks, especially in massively parallel computer system. In direct networks, nodes (computers) are connected to only a few nodes, that is, its neighbours, according to the topology of the networks and communicate with each other by exchanging messages. Moreover, the mesh structure is one of the most important topology of direct networks. Especially, low dimensional mesh networks, due to its low node degree, are more popular than the high dimensional mesh networks. Currently most of architecture of parallel computers is based on two-dimensional mesh topology, for example, Seitz et al. 1988 [2], Intel Touchstone DELTA [3, 4], and Intel paragon.

Several models based on direct networks have been studied ([59]), especially the two-dimensional mesh ([1016], etc.) for communication operations. The purposes of these papers mainly focus on how to route messages in the two-dimensional mesh. Routing is the process to send messages from source nodes to destination nodes, passing some intermediate nodes. A very important aspect of message routing is its ability to route from a source node to a destination node, avoiding all faulty nodes or links.

Basically, there are two types of message routing:(1)deterministic routing that is routing in which the routes between given pairs of nodes are determined in advance of transmission,(2)adaptive routing that allows us to take any path between its source and its final destination; that is, the path is adaptively constructed in the process of routing.

The deterministic routing algorithms are simple and ease of implementation, this is the advantage for deterministic routing. However, adaptive routing can reduce network latency and increase network throughput and the most attractive point is that it can tolerant more faults than deterministic routing [17]. Thus the latter one emerged as an attractive field. In most papers on this field, they often considered how to make a path between source and destination node pairs, avoiding the faulty nodes, and most work used the disconnected rectangular block fault model [11]. The disconnected rectangular blocks are composed of the faulty nodes and their neighboring nonfaulty nodes with the principle of maintaining rectangular shape. As a result, adaptive routing can tolerate faulty nodes by bypassing these rectangles. However, in order to maintain its rectangular shape, the block has to group some nonfaulty nodes inside, called unsafe nodes in these papers. Of course, these unsafe nodes will never be used until their corresponding blocks recovery, and the messages will never be sent to these nodes, while they should be (as illustrated in Figure 1).

Chien and Kim [18] present a partially adaptive algorithm for mesh networks. The basic idea is to use the algorithm to circumfuse any convex faulty regions. If faulty regions are not naturally convex, good nodes and links are marked as faulty until the regions become convex. However, once the faults are located on a boundary, in order to tolerate faults, all nodes form that boundary will become faulty. Boppana and Chalasani [10] use -chain and -ring, which is an extension of disconnected rectangular block fault model, to route the messages around them, and -chain addresses the boundary problem in the Chien and Kim’s paper. But the -chain and -ring may connect with each other; this makes the routing algorithm more complex than [18]. In [11], Su and Shin assume a node to be the basic fault element. They construct the blocks based only on the faulty nodes; thus they can only tolerate faulty nodes except the faulty links. Overall, the construction of these faulty regions is static; that is, once these regions are constructed, all nodes including the good ones in these regions cannot join in routing any more. The faulty regions are not self-adaptive; that is, if some of faulty nodes in these faulty regions are fixed well, then the faulty regions will be held as they were, but actually they can release some good nodes and become smaller ones keeping convex shape.

Adaptive fault-tolerance routing technologies are also using in WSN (Wireless Sensor Networks), MEMS (Micro-Electro-Mechanical Systems) and SoC (System on Chip) to increase the usability and robustness, as well as the whole performance. Most network topology adopted in those domains is 2D mesh. As a result, in recent years, there have been a number of researches focusing on fault-tolerance routing on wsn and Noc [1922].

In this paper, we concentrate on the adaptive routing with fault-tolerant in two-dimensional mesh. Not only we do consider the situation of faulty nodes but also the situation of faulty links incident with any node. However, different from mentioned papers, the novel cracky rectangular block strategy introduced to tolerate faults can route messages both bypassing the cracky rectangular block and along the cracks in the rectangular block (just for a trope, actually they are routed along the connected links inside the faulty blocks). So we can route messages to the nodes both outside and inside the faulty blocks. This is a noticeable overall improvement of fault tolerability of the system. At the same time, the cracky rectangular block is fully self-adaptive. It can tolerate dynamic faults. For example, when some of faulty nodes or faulty links in a block are fixed well, the original block may become a smaller block or split to some smaller ones keeping their shape rectangular. Tolerating dynamic faults can enhance the run-time life of a multicomputer, thus increasing reliability.

The rest of this paper is organized as follows. Section 2 describes the basic routing algorithm in two-dimensional mesh. Section 3 introduces the cracky rectangular block strategy, including the cracky rectangular block model and the routing algorithm on it. This section also describes how the rectangular blocks adapt themselves depending on the situation of networks. Section 4 gives a proof that the message will be sent to any destination in the mesh as long as the mesh keep connecting. A conclusion will be given in Section 5, and it presents possible directions for future work.

2. Basic Routing Algorithm in Two-Dimensional Mesh

2.1. Two-Dimensional Mesh

It is convenient to represent a two-dimensional mesh with graph terminology. Let be undirected graph to represent a network. The set of vertices of graph represents nodes of the network. The set of edges of graph represents links between the nodes. Note that we keep using node and link in this paper.

The two-dimensional mesh , where and are positive integers, is defined as follows.(i)A node of is represented by , , and .(ii)There is a link between two different nodes and if and only if and or and . We denote this link by .

For each and , with and , we call row and column the subgraphs of , respectively, induced by nodes and .

The boundary of is the subgraph of induced by the rows and and the columns and .

Given any node , let be the set of nodes adjacent to in (called as neighbours). Given a nonboundary node of the two-dimensional mesh, the four neighbours of are denoted by , , , and .

For each pair of nodes and , the distance between and , denoted by , is the length (number of links) of a shortest path between and . We define the 1-distance and 2-distance between and , respectively, by and . From the above definition, we know that .

2.2. A Basic Routing Function in Two-Dimensional Mesh

Consider a network , in each node , for each message with final destination , arriving on a link ; we denote by the subset of ’s neighbours bringing closer to its destination if ; otherwise, the message is absorbed by . Actually it is a routing function, this kind of routing is said to be local because it is independent of what happened in the rest of the network and can be computed locally by each router.

The basic routing function is a classical greedy routing function in the two-dimensional mesh as follows. Let be two different nodes in . A message with destination received by the router of arriving from node is sent to a node of , that is, a set of at most two nodes (and at least one node) defined as follows. There are at most two nodes and at distance from . Moreover, when , if (resp., ), then the routing function will choose to send the message to (resp., ). If this is not possible (e.g., the link incident with the chosen node is faulty), then the routing function tries to send the message to (resp., ). Moreover, if both the links and are faulty, then the router of can route to any node of .

2.3. Blocking Situation and Its Traditional Solution

Consider now that a unique message is transmitted to the two-dimensional mesh . As we will show below, in case of some link faults which do not disconnect the network, using the basic two-dimensional routing function does not guarantee that will reach its destination. It can be blocked in a part of the network.

As shown in Figure 2(a), the basic routing function can unfortunately lead to blocking situations due to some properties of the structure of the faulty links. Clearly, in this example the message will always in the subgraph induced by nodes , and and will never reach its destination node . Actually, this message is in livelock situation, which keeps a message moving indefinitely without reaching the destination.

It is well known that the adaptive routing may cause livelock problems. Therefore routing without livelock is one of the most important design issues for communication operations in multicomputer systems (note that we only consider the livelock situation in this paper, and we can solve the deadlock problem with some sophisticated methods [1, 23, 24]). Contemporary, this livelock situation is well addressed by the traditional disconnected rectangular block faulty model (rectangular block for short). However, the usability and robustness of the mesh network will gradually decrease, while the number of faulty nodes increases in this model. As [25]’s experiment shows that the distribution of faulty nodes has the tendency to make the whole mesh to be one “big block." It can be seen from the experiments that, with the rectangular model, there is only one faulty block left when the faulty rate of nodes is 15 percent and the size of two-dimensional mesh is . In consequence the whole mesh becomes useless because this big faulty block occupies the entire mesh region, and we call this as “big block" problem. The novel cracky rectangular faulty block strategy, which we will introduce in the next section, makes full use of nonfaulty nodes/links in the mesh. All the nonfaulty nodes/links that would have been included in original rectangular faulty blocks now can become candidate routing nodes/links.

3. Adaptive Fault-Tolerant Strategy with Cracky Rectangular Block

In order to solve the livelock situation and the big block problem, we propose a novel strategy for fault-tolerant routing. We use the cracky rectangular block to avoid livelock and traverse block’s every connecting internal node if needed. Therefore, we can transmit each message to any node not only outside of a block but also inside of the block like Figure 2(b), and the message can reach the inside nodes , and which are forbidden in the original rectangular block.

Formally, a rectangular block is a submesh of the mesh induced by the nodes with , for each . Let be a node. By definition, if, for each , we have , then belongs to the inside part of the rectangular block . Else, if (or ) and and , then belongs to the border of the rectangular block.

A cracky rectangular block (cracky block for short) is a rectangular block with spanning forest internal induced by all the connecting nodes inside of this block, all the roots of that forest belong to the border of the cracky block, and the spanning forest connects all the internal nodes to their roots if and only if those nodes still keep connecting.

Figure 3 presents two instances of the cracky block in a two-dimensional mesh, which are and , respectively. is a general cracky block, while is a cracky block which is induced by the faulty links on the boundary of the mesh, and it is an incomplete cracky block.

3.1. Construction of the Cracky Rectangular Block

Each node’s activities are based on message-driven mechanism. There are two types of messages routed in mesh. One is entity message (message for short), which is routed between any node pair. The other one is system message, this type of message can only be sent between neighbours, and their contents are mainly about the status of themselves, such as the node’s faulty degree and it’s detailed situation of faulty links. The first one is the entity for computing or communication, and the later one concentrates on maintaining the usability and robustness of networks; in other words, it is for constructing the cracky block when some faults occur in this mesh in order to avoid the livelock situation as mentioned above.

In the beginning, all the nodes work well; that is, there does not exist any faulty node or faulty link. Any node can both receive the message from any of its neighbours, and vice versa, of course, depending on the basic routing strategy. When some nodes or links are ruined because of some reasons, these failed nodes or nodes incident with failed links will judge their current status immediately, and then they send the system messages as soon as possible including their status to its connected neighbours to tell them what have happened in detail. For neighbours, once they receive the system messages, they judge their current status depending on their latest status and the received system messages at once. Of course, they will notice their connected neighbours about current status if and only if the current status is different from previous status. Finally, the construction of a stable cracky block is implemented by the above system messages exchange.

Before exposing our distributed algorithm to construct the cracky block, we will give some definitions first. For a two-dimensional mesh , the faulty degree of a node is the number of failed links incident with , and we denote it by . From the observation of a cracky block, there are three types of nodes in a mesh network: faulty node, good node, and border node. A faulty node belongs to the interior of a cracky block and , and oppositely, a good node allocates outside of any cracky blocks and . Of course a border node belongs to the border of a cracky block with . For example, in Figure 3, , and in the cracky blocks and are faulty nodes, and are border nodes, and any node outside of and is good node.

Given any node , let be the set of neighbors of such that the link is not faulty. Moreover, we set an order in as follows: . Let be the node who sends message to node , and let be the node who will receive the message sent by .

We denote by the status of , and is one of the elements of status set . The status of a node will indicate which type of node it is. In detail, there are two more status of , which are empty set and universal set . And shows that the node is a faulty node, identifies a good node, and if , then must be a border node. For example, if , then the node locates at the north border of a cracky block, like in Figure 3, or if , then is at the northeast corner, just like . Totally, the system message can be sent to four neighbours, and will be , and according to , and . A system message sent by node is which includes the destination neighbour and sender’s current status. For example, means that this message will send to and . We define an operation to implement the status judgment of a node who receives a novel system message. The corresponding algorithm to update the status for any node in mesh network is given by Algorithm 1.

Input:   : faulty degree of node .
Output:   : status of node .
 procedure INITIAL_STATUS
if   then
  
else if   then
   depending on Table 1
else if   then
  
end if
 end procedure
Input:   : the current status of node , : the system message received by node .
Output: : the updated status of node .
 procedure UPDATE_STATUS
  if     then
  end if
 end procedure
Input:   : the updated status of node .
Output: : the system message to be sending to .
 procedure NOTICE_STATUS
  if     then
send system message to
  else if     then
send system message to both and if exist
  else if      then
send system message to both and if exist
 end if
end procedure

At the beginning of the construction, every node should run the procedure initial_status respectively to make sure its status according to its faulty degree . After finishing the above procedure, node will run the procedure notice_status to send system messages to neighbours according to its status . Once a neighbour node receives this type of message, it will run the procedure update_status to refresh its latest status. Actually, this process will be repeated until every node’s status getting stable. Finally, there will emerge some cracky blocks in the mesh. For example, Figure 4 shows a distributed process to construct a cracky block. We just pick up four nodes, , and , to describe how the algorithm performs. During the first phase, node will initial its status because of ; meanwhile as a result of and separately. In the second phase, according to the algorithm only node will send system message to its connected neighbours which are nodes and . Finally node receives two system messages and will refresh its new status by , so it will be the northwest of a cracky block for this moment. For nodes and , and are a west border node and a faulty node. When the algorithm stops, there is a border of crack block as shown in Figure 4 by the bold line. The macroconstruction of a block depends on the microdistributed message exchange activities of relative nodes.

Then we will construct the spanning forest for the faulty nodes. We say that the faulty node is hung if and only if it chooses exactly one neighbor as predecessor. Denote by the predecessor of , and is a faulty node and . We consider an order over the elements of . We denote by the th element of , with . A node is said to be final if . A node which is not hung is free. We denote by these two boolean states , which refers to the hung and free status for each node inside of the block. After running Algorithm 2, the spanning forest for the block will be accomplished; like and in Figure 3, every node inside of blocks and will find only one predecessor and be marked as hung.

Input:   : node is in a cracky rectangle block.
Output:   : , : predecessor of .
 procedure HUNGNODEONFOREST
if   is a or node then
   
   return
else if   is a faulty node then
    
end if
  
check for all until s.t.
 end procedure

3.2. The Cracky Rectangular Block Is Stable

A node is said to be stable if ’s status can never change to a free status, during the running time of the algorithm. In particular the nodes of the cracky block are stable. A cracky block is said to be stable if all the nodes belonging to are stable.

To prove that the cracky block is stable, we assume that there exists a set of nonstable nodes. If , then is free and can never become stable during the running time of Algorithm 2, so in this case there is no stable node in the neighborhood of because otherwise will choose this node as predecessor. Using the same argument for each node of , there is no stable node in the neighborhood of . But since the graph is connected, is necessarily joined by a path to a node of the border of the block. So we are in contradiction with the fact that the nodes of the border are stable. If , then since the graph is connected there exists at least one node in which is adjacent to a node . Clearly, is stable. From the second loop of the algorithm and since there is an order in the neighbors of for the choice of its predecessor, there exists a step in the algorithm which leads the node to choose as predecessor. After this step, let . Using the same arguments, after some steps of the algorithm, the set would be empty. So all the nodes are stable.

3.3. Adaptive Routing with the Cracky Rectangular Blocks

In this section, we will give the global fault-tolerant routing strategy. Primarily, once a message encounters a cracky block, this message will bypass the cracky block, which encloses the faulty nodes/links, along its border node in a clockwise (or counter-clockwise) manner. Especially, the message should traverse the interior spanning tree rooted with the border node by Depth-First-Search, while it bypasses the cracky block. Finally, the message will leave the cracky block from one of its corners which is the nearest from the destination and keeps going with the basic routing function; otherwise the message will be absorbed by the interior node which must be the destination node.

We now give the complete local routing function we run in each node of , as shown in Algorithm 3. This algorithm is based on the basic routing function we have defined in Section 2.

Input:   : node routing messages, : the node who sends messages to node .
Output:   : the node who will receive the messages.
 procedure ROUTING
if   is a good node then
   basic routing function with node
else if   is a faulty node then
   if ( is final) or then
    
   else if   and then
    
   else if     then
    
  end if
else if   is a border node then
  routing according to Table 2.
else if   is a node belongs to the border of mesh then
  if   is final then
   
  else
   
   
  end if
end if
 end procedure

The cracky rectangular block and the adaptive fault-tolerant algorithm make up the fault-tolerant strategy, and we can use Algorithm 3 to send a message from any connected node to arbitrary connected node. For example, in Figure 3, the good node wants to send a message to the node , but node is a faulty node and locates interior the cracky block , the algorithm will send this message along the path shown in the figure, and the faulty node sending message to another faulty node also can be accomplished by the algorithm; if a good node wants to communicate with another good node , the routing path will like the situation depicted in the figure.

3.4. Self-Adaptive and Faulty Boundary Independency of Cracky Rectangular Block

For high performance and usability, the cracky blocks should be self-adaptive. As we know, the emergency of cracky blocks in a mesh is the result of nodes managing themselves distributedly and independently. The status of an isolated node is closely related to their neighbours. Therefore the size and shape of a block are dynamic according to faulty nodes. In other words, if some of the faulty nodes have been fixed, the original block may become a smaller one or split up into smaller ones. On the contrary, if some good nodes or links fail, there will be some new cracky blocks or some of the original cracky blocks grow huge as a result.

Given a two-dimensional mesh , let be a faulty node in cracky block . Let with , and let with . There is a fact that when has been repaired such that , then we should make sure if (resp., ). If they are, then (resp., ) may be cancelled from the block and will become four smaller ones at most. In addition, these new cracky blocks still keep stable. The cancelled row or column may becoming the new border belonging to those new cracky blocks, alternatively becoming the good ones outside any blocks, so they will still keeping hung, certainly their successors will also keeping hung. To implement the above, when a node with its incident links is fixed well, we just send a recovery signal to its four neighbours to rerun the procedure initial_status in Algorithm 1. Recursively, the recovery signal will be sent to nodes which connected with the faulty nodes received the signal until it meets the good node outside the cracky block.

For example, Figure 5(a) shows a cracky block, and are two faulty nodes with , , and . When the two nodes and have been repaired, they all changed to good nodes with . The cracky block will become like Figure 5(b), and , and become the new border and the cracky still keeps stable (note that we do not give the detail because of the page limitation).

Our adaptive fault-tolerant routing strategy is faulty boundary independency; that is, if there exists a fault occurring on the boundary of the mesh, the strategy is still running. Lines 14 to 21 in Algorithm 3 give a solution to this situation. When some of the boundary nodes of mesh have failed, then the corresponding cracky block will be constructed like in Figure 3. As shown, it is an incomplete cracky block. If wants to send a message to , the message will first go to node according the basic routing function. Because is the border node of , the message will be sent to which is a boundary node of the mesh. When the message traverses all the successors of , it will be rebound to the node to continue routing and finally find its destination. To sum up, the message will continue routing when it encounters a mesh boundary because of the rebound function.

4. The Cracky Rectangular Model Is Creditable

The next two propositions show that, with the above algorithm, each message will reach its destination, if a message arrives in a node of a cracky block ;(i)if its destination is in , it will reach it;(ii)if its destination is out of , it will leave closer to its destination than before.This is shown by the next two results.

Proposition 1. Consider a cracky block . By using Algorithm 3, if a message has its destinations in and if it arrives in a node of the cracky block, then it will reach its destination.

Proof. Consider a subgraph of the mesh induced by the nodes of a cracky block . By definition of Algorithm 3, a message moving on follows a circuit crossing each node at least once.
Consider a message moving in the mesh, with destination , reaching a node . By definition of the routing function, since and , then will never want to take an arc out from the . Thus, it follows the circuit of induced by the algorithm. So, will arrive in node .

Proposition 2. Consider a cracky block . By using the algorithm, if a message has its destination outside and if it arrives in node of the cracky block, then it will leave and be closer to its destination than before.

Proof. Let be a message with destination . Suppose that moves from a node to another node by the dimension according to the routing algorithm; that is, and . We claim that, in the following routings, except the special case that moves along a quasi-Hamiltonian cycle of a cracky block, the routing of will never augment the -dimension distance.
To prove the claim, it is clear that by the routing algorithm, if do not meet a cracky block, it cannot move from a node to another one with . Suppose now that moves from to meet a node , with , of a cracky block by the dimension and suppose that it leaves by a node to a node by the dimension . According to the routing algorithm, without loss of generality, we may assume that the movement of from to does not augment the -dimension distance; that is, . By the routing function, we have
Assume that . We have all equalities in , in particular . For any , , it follows that , which implies that would leave from the node by the -dimension and hence . This gives a contradiction. Therefore we have and the claim holds.
Consequently, if leaves a cracky block , it will never be sent back to .

5. Concluding Remarks

In this paper, we propose a cracky rectangular fault block model for faulty-tolerant adaptive routing in two-dimensional mesh interconnection networks. This model improves the widely used rectangular model by taking into consideration the faulty links instead of faulty nodes in the process of constructing cracky blocks. It has been shown that we construct the spanning forest, which rooted with the border node, for all connected node in the cracky blocks. Thus the message can traverse all the nodes inside of the block by a kind of Depth-First-Search. As a result in the cracky block model, all faulty nodes that would have been useless now can be used for routing. Meanwhile the cracky block manages the size and scale in a self-adaptive mode; that is, the number or size of cracky block will gradually grow huge because of the increasing of faulty nodes/links, and contrarily, they will decrease for the fixed nodes/links. Based on the cracky block model, an algorithm is proposed to route message in the two-dimensional mesh without livelock. The novel strategy for fault-tolerant routing is faulty boundary independency, and it can apply the faulty occurring on the mesh boundary. The novel strategy for fault-tolerant routing improves the robustness and performance of two-dimensional mesh interconnection networks.

In the future, we will extend this strategy to multidimensional mesh networks, we have already testified that the construction method is suitable for multidimensional mesh networks, and then we will attempt to extend the routing algorithm to find a circuit on the multidimensional cracky blocks. In addition, we will add routing table to the cracky blocks to minimize the totally routing hops. These will come up in our next paper.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Natural Science Foundation of P. R. of China (61300230), the Key Science and Technology Foundation of Gansu Province (1102FKDA010), Natural Science Foundation of Gansu Province (1107RJZA188), and the Fundamental Research Funds for the Central Universities for supporting this research.