#### Abstract

Establishing high-speed and reliable underwater acoustic networks among multiunmanned underwater vehicles (UUVs) is basic to realize cooperative and intelligent control among different UUVs. Nevertheless, different from terrestrial network, the propagation speed of the underwater acoustic network is 1500 m/s, which makes the design of the underwater acoustic network MAC protocols a big challenge. In accordance with multichannel MAC protocols, data packets and control packets are transferred through different channels, which lowers the adverse effect of acoustic network and gradually becomes the popular issues of underwater acoustic networks MAC protocol research. In this paper, we proposed a control packet collision avoidance algorithm utilizing time-frequency masking to deal with the control packets collision in the control channel. This algorithm is based on the scarcity of the noncoherent underwater acoustic communication signals, which regards collision avoiding as separation of the mixtures of communication signals from different nodes. We first measure the W-Disjoint Orthogonality of the MFSK signals and the simulation result demonstrates that there exists time-frequency mask which can separate the source signals from the mixture of the communication signals. Then we present a pairwise hydrophones separation system based on deep networks and the location information of the nodes. Consequently, the time-frequency mask can be estimated.

#### 1. Introduction

Underwater acoustic networks are the key technology to realize cooperative and intelligent control among multi-UUVs [1, 2]. However, compared with terrestrial wireless network, the propagation velocity of underwater acoustic networks is only m/s and the available bandwidth of underwater acoustic networks is very limited. Moreover, the time delay, Doppler extension, and noise interference cannot be avoided either. The adverse factors mentioned above make the key technology of underwater acoustic network communication MAC protocol design big challenges and restrict the improvements of underwater acoustic network. Presently, most researchers classify underwater acoustic network MAC protocols into 3 types: contention-free, contention-based, and hybrid based on the difference of multiuser access mechanism. Figure 1 has illustrated the existing underwater acoustic network MAC protocols and its categorizing [3, 4].

Considering the long-time delay of underwater acoustic channel, maintaining the real-time state between adjacent nodes is different. Concise contention-free protocol is firstly used in the underwater acoustic network, which includes frequence division multiple access (FDMA), time division multiple access (TDMA), and code division multiple access (CDMA). FDMA separates available frequency band into different subbands and allocates the subband specifically to the node, which is simple and reliable. However, the low bandwidth availability ratio is a big disadvantage [3–5]. In order to solve the problem, orthogonal frequency division multiplexing (OFDM) has been introduced to FDMA. The principle of FDMA is choosing proper frequency band to communicate according to different communication distances and therefore improves the bandwidth availability ratio [6, 7]. To improve channel utilization fundamentally, researchers begin to study the MAC protocol based on TDMA. The ST-MAC protocol, which translates the issue of multiple joints time slot to the problem of vertex-coloring, was proposed in [8]. Reference [9] puts forward DSSS protocol, which utilizes transmission delay of underwater acoustic channel and arranges conflict-free transmission concurrently. The STUMP protocol which eases the limitation of time synchronization was present in [10]. Different from TDMA, CDMA protocols distinguish different users through pseudonoise, which has high channel utilization and simple algorithm but still cannot avoid its inherent “near-far effect.”

Compared with the above-mentioned contention-free protocol this allocates channel beforehand and contention-based protocol which allocates channel based on the need of nodes has higher channel utilization. The essence of contention-based protocol is based on channel reservation. Nodes reserve channel resources through handshaking exchange message before initiating data communication [11, 12]. The multichannel MAC protocols transmit handshaking exchange message via independent channels, initiates information transmission among multinode pairs concurrently, utilizes network bandwidth and reduces the consumption when network load is heavy [13–15], and attracts attention of the researchers recently. The prospective of the multichannel MAC protocols is solving new problems which are faced with multichannel protocols, especially the collision problem in control channel. Zhou et al. from University of Connecticut adopt joint detection of adjacent nodes to tackle triple hidden terminal problems typical in multichannel MAC protocols [16].

In this paper, we proposed a control packet collision avoidance algorithm utilizing time-frequency masking to deal with the control packets collision in the control channel. This algorithm is based on the scarcity of incoherent underwater acoustic communication signals and regards collision avoiding as the separation of the mixtures of communication signals from different nodes. The remaining contents of the paper are organized as follows. Section 2 briefly discusses the W-Disjoint Orthogonality and the scarcity of the MFSK signal. The simulation result demonstrates that there exists time-frequency mask which can separate the source signals from the mixture of the communication signals. Section 3 outlines the proposed separation system, discusses the low-level feature we sued, and gives details about the deep network including its structure as well as training method. Section 4 shows the simulation result about the source separation system in different conditions, including different signal-noise ratio and different bandwidth ratio.

#### 2. W-Disjoint Orthogonality of the MFSK Signals

As is known to all, the MFSK is a classic noncoherent communication modulation scheme that has been considered as a robust modulation to the complex underwater acoustic channel. Because of its lower bandwidth efficiency than the coherent modulation, such as PSK modulation, the MFSK modulation is not considered as a good choice for the physical layer of the underwater acoustic networks. However, the lower bandwidth efficiency means that the MFSK signal is sparse in time and frequency domain. Same as the speech signals mentioned in [17], the MFSK mixtures can be separated into several sources by using time-frequency masking. The received signal can be seen as MFSK mixture when the control packets collide, and the sparsity of the MFSK signal in time and frequency domain offers the potential for dealing with the collision of the control packets.

In this section, we focus on the W-Disjoint Orthogonality and the sparsity of the MFSK signals, showing that there exist time-frequency mask which can separate the source signals from the mixture of the MFSK signals. We only consider the MFSK modulation as the physical layer of the underwater acoustic networks in this paper.

Same as the model of speech mixture, the model of the MFSK mixture can be written as follows:

With the short-time Fourier transform, we obtain the model of the MFSK in time and frequency domain through where are the indexes of time frame and frequency point, respectively. Assuming is W-Disjoint Orthogonality, at least one of the nodes signals will be nonzero for a given . To separate node signal from the mixture , we create the time-frequency mask for each node, respectively, and apply these masks to the mixture to obtain the original node signal. For instance, with the defining mask for the node ,which actually is an indicator function, the MFSK signal from node will be derived via

However, for the MFSK signal, the assumption about the W-Disjoint Orthogonality is not strictly satisfied. When the sparsity of the MFSK signals in time and frequency domain is taken into consideration, the approximate W-Disjoint Orthogonality will be satisfied. In order to measure the W-Disjoint Orthogonality of the T-F mask, the combined performance criteria PSR and SIR, which are proposed by Yilmaz and Rickard in [17], will be used:

With a view to the quite small probability of the collision of more than two data packets in control channel, we have produced a series of MFSK mixture only including two node signals and calculated the PSR, SIR, and WDO of the T-F Mask by the use of Monte Carlo method. Thereinto, the T-T Mask for source separation is derived as follows:

According to the definition of the W-Disjoint Orthogonality, the corresponding T-F mask becomes closer to the W-Disjoint Orthogonality as the signal being more sparse. The sparsity of MFSK signal is reflected by bandwidth ratio, and the lower the bandwidth ratio of the signal is, the higher its sparsity becomes.

By the conclusion of [17], it could be thought that mixed signal is able to be demixed through a T-F mask when the value of WDO is close to 1. In accordance with the simulation result shown on Figures 2, 3, and 4, we believe that an existing T-F mask could separate sources from the MFSK mixture with high quality when the bandwidth ratio is greater than 0.5.

#### 3. The Source Separation System Utilizes Deep Networks

In this section we will outline the proposed separation system, discuss the low-level features we used, and give details about the deep networks including its structure and training method.

##### 3.1. Observation in Time and Frequency Domain and Low-Level Features Used in Deep Networks

We assume that there are hydrophones in the node and the mixture includes several MFSK signals from nodes. The mixture received by the hydrophone can be obtained from where is the channel impulse response between the hydrophone and node , and the denotes convolution. Then by using the short-time Fourier transform (STFT), we can obtain the mixture signal mapped into the time-frequency domain from where and are the time frame and frequency bin indices, respectively, and , , , denote short-time Fourier transform. We use representing the component of the node which takes in the mixture received by hydrophone . Thus,

As shown in Section 2 MFSK signal is approximate W-Disjoint Orthogonality when bandwidth ratio is less than . Then, as shown in (10), the mixture which received by the hydrophone , can be demixed by using the T-F masks corresponding . Consider the following:

There is a nature choice that uses the orientation of the MFSK signals to estimate the corresponding T-F mask of the nodes with the pairwise hydrophone. Thus, the time-frequency mask is related to the location information of the current input signals (the channel impulse response and array manifold). That is,

Obviously, for other hydrophone output, the time-frequency mask corresponding to the same node remains the same. When we obtain the time-frequency corresponding to node , we can then separate the single user communication signals of node through ISTFT operation. The mixture signals received by the multiarray include single user signals from nodes with different locations. The probability distribution of a single T-F point belonging to a certain node, such as , can be described by a Gaussian distribution with mean and variance . Here, and variance can be interpreted as the mean value and variance of direction of arrival (DOA) of the signal coming from the node , respectively. According to the central limit theorems, for all T-F points, the probability distribution of mixture signals including node signals can be described by the mixture of Gaussian distribution, namely, Gaussian mixture model (GMM). Therefore, we can describe the mixture pattern of signals through a GMM based on space features. The parameters of the GMM can be obtained on the basis of the existing observation datasets. Thus the probability of each T-F point belonging to node , node , and node can be estimated. Then, the source signals can be recovered through ISTFT operation.

##### 3.2. Outline of the Source Separation System

As shown in Figure 5, the inputs to the system are the two channel MFSK mixtures. We perform short-time Fourier transform (STFT) for each channel and obtain the T-F representation of the input signals, and where and are the time frame and frequency bin indices, respectively. The low-level features, that is, mixing vector (MV), interaural level, and phase difference (IPD/ILD) which can be derived from (12)–(13), are then estimated at each T-F unit. Considerwhere is a whitening matrix, with each row being one eigenvector of , the superscript is Hermitian transpose, and is Frobenius norm:where takes the absolute value of its argument and finds the phase angle.

Next, we group the low-level features into blocks (only along the frequency bins ). The block includes frequency bins , where . We build deep networks with each corresponding to one block and use them to estimate the direction of arrivals (DOAs) of the sources. The low-level features as the input of the deep networks are composed by the IPD, ILD and MV; that is, . Through unsupervised learning and the sparse autoencoder [18] in deep networks, high-level features (coded positional information of the sources) are extracted and used as inputs for the output layer (i.e., the softmax regression) of the networks. The output of softmax regression is a source occupation probability (i.e., the time-frequency mask) of each block (through the ungroup operation, T-F units in the same block are assigned with the same source occupation probability) of the mixtures. Then the sources can be recovered applying the T-F mask to the mixtures followed by the inverse STFT (ISTFT). The deep networks are pretrained by using a greedy layer-wise training method.

##### 3.3. The Deep Networks

As described in the beginning of this section, we group the low-level features into blocks and build individual deep networks which have the same architecture to classify the DOAs of the current input T-F point in each block. The deep network which is used to estimate the T-F mask is composed of two-layer deep autoencoder and one layer softmax classifier.

More specifically, we split the whole space to ranges with respect to the hydrophones and separate the target and interferers based on different orientation ranges (DOAs with respect to the receiver node) where they are located. We apply the softmax classifier to perform the classification task, and the inputs to the classifier, that is, the high-level features , are produced by the deep autoencoder. Assuming that the position of the target in the current input T-F point remains unchanged, the deep network estimates the probability of the orientation of the current input sample belonging to the orientation index . With the estimated orientation (obtained by selecting the maximum probability index) of each input T-F point, we cluster the T-F points which have the same orientation index to get the probability mask and obtain the T-F mask from the probability mask through the ungroup operation. Note that each T-F point in the same block is assigned the same probability. The number of sources can also be estimated from the probability mask by using a predefined probability threshold, typically chosen as 0.2 in our experiments.

###### 3.3.1. Deep Autoencoder

An autoencoder is an unsupervised learning algorithm based on backpropagation. It aims to learn an approximation of the input . It appears to be learning a trivial identity function; but by using some constraints on the learning process, such as limiting the number of neurons activated, it discloses some interesting structures about the data. Figure 6 shows the architecture of a single layer autoencoder. The difference between classic neural network and autoencoder is the training objective. The objective of the classic neural networks is to minimize the difference between the label of input training data and the output of the network. However, the objective of the autoencoder is to minimize the difference between the input training dataset and the output of the network. As shown in Figure 6, the output of the autoencoders can be defined as with , where the function is the logistic function, , , , and , is the number of hidden layer neurons, and is the number of input layer neurons, which is the same as that of the output layer neurons. is a matrix containing the weights of connections between the input layer neurons and hidden layer neurons. Similar to , contains the weights of connections between the hidden layer neurons and the output layer neurons. is a vector of the bias values added to the hidden layer neurons, and is the vector for the output layer neurons. refers to the parameter set composed of weights and bias . The neuron is “active” when the output of this neuron is close to 1, which means that the function . For “inactive” neurons, however, the output is close to 0, which means the function , where denotes the weights of connections between the hidden layer neuron and the input layer neurons, which is the th row of the matrix . is the th element of the vector , which is the bias value added to the hidden layer neuron . The superscript of , , and , denotes the th layer of the deep network.

With the sparsity constraint, most of the neurons are assumed to be inactive. More specifically, denotes the activation value of the hidden layer unit in the autoencoder. Generalizing this for the unit in the hidden layer, the average activation of unit with the input sample can be defined as follows: where is the number of training samples, and is the th input training sample. Next, the sparsity constraint is enforced, where is the parameter preset before training, typically small, such as . To achieve the sparsity constraint, we use the penalty term in the cost function of sparse autoencoders as follows:The penalty term is essentially a Kullback-Leibler (KL) divergence. Now the cost function of the sparse autoencoder can be written as follows:where controls the weight of the penalty term. In our proposed system, the cost function is minimized using the limited memory BFGS (L-BFGS) optimization algorithm, and the single layer sparse autoencoder is trained by using the backpropagation algorithm.

After finishing the training of single layer sparse autoencoder, we discard the output layer neurons, the relative weights , and bias and only save the input layer neurons and . The output of the hidden layer is used as the input samples of the next single layer sparse autoencoder. Repeating these steps, like stacking the autoencoders, we could build a deep autoencoder from two or more single layer sparse autoencoders. In our proposed system, we use two single layer autoencoders to build a deep autoencoder. The stacking procedures show on the right part of Figure 7.

Lots of studies on deep autoencoders show that, with the deep architecture (more than one hidden layer), deep autoencoder could build up more complex representation from the sample low-level features, capture the underlying regularities of the data, and improve the qualities of recognition. That is why we use deep autoencoder in our proposed system.

There are, however, several difficulties associated with obtaining the optimized weights of deep autoencoders. One challenge is the presence of local optima. In particular, training a neural network using supervised learning involves solving a highly nonconvex optimization problem, that is, finding a set of network parameters to minimize the training error . In the deep autoencoder, the optimization problem with bad local optima turns out to be rife, and training with gradient descent no longer works well. Another challenge is the “diffusion of gradients.” When using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminish in magnitude as the depth of the network increases. As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small. Thus, when using gradient descent, the weights of the earlier layers change slowly. However, if the initializing parameter is already close to the optimized values, the gradient descent works well. That is the idea of “greedy layer-wise” training, where the layers of the networks are trained one by one, as shown in the left part of Figure 7.

First, we use the backpropagation algorithm to train the first sparse autoencoder (only including one hidden layer), with the data label being the inputs. In our proposed system, the input data is the -dimensional feature vector . As a result of the first-layer training, we get a set of network parameters (i.e., the parameters and ) of the first-layer sparse autoencoder, and a new dataset (i.e., the features I shown in Figure 9) which is the output of the hidden layer neurons (the activity state of the hidden layer neurons) by using this parameter set. Next, we use as the inputs to the second sparse autoencoder. After the second autoencoder is trained, we can get the network parameters of the second sparse autoencoder and the new dataset (i.e., the feature II in Figure 9) for the training of next single layer neural network. We then repeat the above steps until the last layer (i.e., the softmax regression in our proposed system). Finally, we obtain a pretrained deep autoencoder by stacking all the autoencoders and use and as the initialized parameters for this pretrained deep autoencoder. The feature II is the high-level feature and can be used as the training dataset for softmax regression discussed next.

###### 3.3.2. Softmax Classifier

In our proposed system, the softmax classifier, based on softmax regression, was used to estimate the probabilities of the current input T-F point belonging to the orientation index , by the deep autoencoder with the extracted high-level features as inputs.

The softmax regression generalizes the classical logistic regression (for binary classification) to multiclass classification problems. Different from the logistic regression, the data label of the softmax regression is an integer value between and ; here is the number of data classes. More specifically, in our proposed system, for classes, samples dataset was used to train the th deep network:where is the label of the th sample and will be set to if belongs to class . The architecture of the softmax classifier is shown in Figure 8.

Given an input , the output of the layer neuron gives the probability of the input belonging to the class . Similar to the logistic regression, the output of the softmax regression can be written as follows:

Here is the network parameter of the softmax ( is the dimension of the input , and is the number of the classes within the input data). represents the transpose of the th row of and contains the parameters of the connections between the neuron of the output layer and the input sample .

The softmax regression and logistic regression have the same minimization objective. The cost function of the softmax classifier can be generalized from the logistic regression function and written as follows:where is the indicator function, so that and . The element of the contains the parameters of the connections between the neuron of the output layer and the neuron of the input layer. is a regularization term that enforces the cost function to be strictly convex, where is a weight decay parameter predefined by the users.

In our proposed system, we extend the label of the training dataset to a vector and set the th element of to 1 and other elements to 0 when the input sample belongs to class , such as . The cost function can be written as follows by using vectorization:

The softmax classifier can be trained by using the L-BFGS algorithm based on a dataset, in order to find an optimal parameter set for minimizing the cost function . In our proposed system, the dataset for softmax classifier training is composed by two parts. The first part is the input sample, (feature II), calculated from the last hidden layer of the deep autoencoder. The second part is the data label, , where the th element of the will be set to 1 when the input sample belongs to the source located in the range of DOAs of index .

We stack the softmax classifier and deep autoencoder together after the training is completed, as shown on the left part of Figure 9. Finally, we use the training dataset and L-BFGS algorithm to fine-tune the deep network with the initialized parameters , , , , , and obtained from the sparse autoencoders and softmax classifier training. The training phase of the sparse autoencoders and softmax classifier are called pretraining phase, and the stacking/training of the overall network, that is, deep network, is called fine-tuning phase. In the pretraining phase, the shallow neural networks, that is, sparse autoencoders and softmax classifier, are training individually, using the output of current layer as the input for the next layer. In the fine-tuning phase, we use the L-BFGS algorithm (i.e., a gradient descent method) to minimize the difference between the output of the deep network and the label of training dataset. The gradient descent works well because the initialized parameters obtained from the pretraining phase include a significant amount of “prior” information about the input data through unsupervised learning.

#### 4. Experiments

In this section, we describe the communication system used in the simulation system and show the separation results in different SNR and data rates. It is important to note that bit error rate (BER) is an extremely significant performance index for a pragmatic communication system; in this section, therefore, we will evaluate the separation quality of pragmatic system by BER and regard the BER performance of single user MFSK communication as the baseline under the same condition.

There is no need to take the multipath propagation of channels into consideration because of the relatively close distance among each node, in view of the communication of multi-UUV. We thus adopt the additive white Gaussian noise channel model and Monte Carlo method to calculate the system BER in the simulation, such as the simulation system shown in Figure 10, and we also take the BER of the single user communication as the baseline. The sampling rate of our simulation is kHz, and the order of the modulation is , without any error-correction coding. The bandwidth of the MFSK modulation/demodulation is Hz, and the bandwidth ratio varies from 0.625 to 0.125, corresponding to the bit rate to .

Frist, suppose we can obtain the optimal T-F mask, that is, using (6) to calculate T-F mask, and then get the simulation result of BER being changing with SNR as well as bandwidth ratio and make a comparison with baseline.

As shown in Tables 1 and 2, it can be seen that, from the simulation result, the BER similar with baseline can be attained by T-F mask method when . Furthermore, the BER with T-F mask would be even lower than that of baseline under the same condition when SNR is very low; that is, . The reason behind the above phenomenon is that, it can be seen from (6) that some frequency points are adjusted to zero by the T-F mask, which promotes the SNR of signal objectively and obtains lower BER. In the actual system, however, it is a highly challenging task to accurately estimate the T-F mask under such low SNR, which can also be verified in later simulation.

According to Section 3, we estimate the T-F mask by using the orientation information of the nodes, that is, the DOA of the MFSK signal received by nodes. Therefore, we introduce time delay to simulate pairwise hydrophone receiver in the simulation and divide the space into 37 blocks along the horizontal direction at the same time, in order to correspond from −90 to +90 horizontally with the step size of 5 in a horizontal space. We estimate the T-F mask by source separation system described in Section 3 and compare it to the BER performance of baseline under the condition of different SNR and bandwidth ratio; the simulation result is shown in Table 3.

As shown in Table 3, it is observed that the BER performance of the proposed system is much the same as the baseline when and , which is consistent with the result when using the optimal T-F mask. When , however, the BER performance of the proposed system begins to decline for a big error of T-F mask estimation made by the lower SNR of the signals, which results in the system performance degradation.

#### 5. Summary

In this paper, we point at the problem of control packets collision avoiding existing widely in multichannel MAC protocol and, on the basis of the sparsity of noncoherent modulation-MFSK in time-frequency domain, separate the sources from the MFSK mixture caused by packets collision through the use of T-F masking method. First, we indicate the sparsity of MFSK signal with bandwidth ratio and demonstrate the relation between bandwidth ratio and PSR, SIR, and WDO by means of the simulation experiment. Then, we establish the source separation system based on deep networks and the model of MFSK communication system, taking single user MFSK communication as the baseline to compare the BER performance of proposed system and baseline under different condition of SNR and bandwidth ratio. The simulation result shows that, first, the optimal T-F masking could obtain the same BER performance as the baseline under lower bandwidth ratio; second, the proposed system could obtain the similar BER performance as the baseline under higher SNR; third, the BER performance of the proposed system declines rapidly under the condition of lower SNR, for lower SNR leads to a greater error in the estimation of T-F mask. In the future work, we will adjust the structure of deep networks in subsequent research work to promote the performance of proposed system under the condition of low SNR and multipath propagation presenting in the underwater channel. As a future research topic, it also deserves the possibility that the bioinspired computing models and algorithms are used for the underwater multichannel MAC protocols, such as the P systems (inspired from the structure and the functioning of cells) [19, 20], and evolutionary computation (motivated by evolution theory of Darwin) [21, 22].

#### Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

#### Acknowledgments

This research was supported partially by the Natural Science Basis Research Plan in Shaanxi Province of China (Program no. 2014JQ8355).