Abstract

Automatic modulation recognition plays an important role in many military and civilian applications, including cognitive radio, spectrum sensing, signal surveillance, and interference identification. Due to the powerful ability of deep learning to extract hidden features and perform classification, it can extract highly separative features from massive signal samples. Considering the condition of limited training samples, we propose a semi-supervised learning framework based on Haar time–frequency (HTF) mask data augmentation and the positional–spatial attention (PSA) mechanism. Specifically, the HTF mask is designed to increase data diversity, and the PSA is designed to address the limited receptive field of the convolutional layer and enhance the feature extraction capability of the constructed network. Extensive experimental results obtained on the public RML2016.10a dataset show that the proposed semi-supervised framework utilizes 1% of the given labeled data and reaches a recognition accuracy of 92.09% under 6 dB signals.

1. Introduction

Automatic modulation recognition (AMR) can detect the modulation type of received signal automatically without prior knowledge. It plays a pivotal role in civilian and military applications, such as cognitive radio, signal recognition, spectrum awareness, and electronic warfare. With the increasing number of users, the limited spectrum resources make it difficult to meet the dynamic needs such as 5G [1], etc. This makes AMR a highly challenging task [2].

Typically, traditional AMR methods can be divided into two categories: likelihood-based (LB) methods and feature-based (FB) methods [3]. LB methods [46] need prior knowledge and suffer from high-computational complexity. FB methods [79] rely heavily on the manual analysis when perform feature selection. Finding distinguishing features among multiple modulation types can be challenging [10].

Recently, inspired by the excellent approaches of deep learning (DL) [1113], many researchers [1417] have explored utilizing DL to achieve improved AMR performance. Hong et al. [16] utilized an recurrent neural network (RNN) to extract temporal features automatically, thus reducing the dependency on manual analysis. Yashashwi et al. [17] designed a learnable module that improves signal classification accuracy by correcting frequency offset and phase noise. These supervised learning (SL) methods require extensive labeled data and are prone to overfitting. However, the availability of high-quality labeled data is limited in practical AMR tasks due to the challenges and costs associated with its collection. Furthermore, the performance of AMR on SL methods may be heavily affected by inaccurate or incomplete label.

Therefore, some researchers [18, 19] have applied semi-supervised learning (SSL) to address AMR tasks. However, these SSL methods may suffer from low signal-to-noise ratio (SNR) and encounter difficulties in handling challenging environments such as congnitive radio. To tackle the challenge of low SNR, this paper introduces a novel SSL framework for AMR called HTF-PSA-SSL. The framework leverages a Haar time–frequency (HTF) mask and a positional–spatial attention (PSA) mechanism to enhance modulation recognition accuracy while minimizing the reliance on labeled data. In the first step, the 1D raw IQ signals undergo preprocessing by applying the discrete short-time Fourier transform (STFT). This transformation converts the signals into a 2D STFT spectrogram, enabling a more comprehensive understanding of the signal’s characteristics. Subsequently, we adopt the well-known SSL mean teacher (MT) [20] as the main framework in our approach, enabling us to effectively utilize a larger quantity of unlabeled data alongside the labeled data. Furthermore, we propose a HTF mask that enhances the utilization of unlabeled data by generating augmented samples and mitigating the risk of overfitting. Moreover, to enhance the strip-shaped features of signal, a PSA is added after each convolutional layer to help the network focus on crucial signal regions from time and frequency domain. Finally, to further verify the superiority of the proposed framework, we evaluate the performance of HTF-PSA-SSL on three public datasets, namely, RML2016.10a, RML2016.10b, and RML2016.04c [20]. The evaluation results show that HTF-PSA-SSL not only efficiently utilizes unlabeled data to improve its recognition performance but also achieves beneficial robustness.

The principal contributions of this paper can be summarized as follows:(1)To address the problem of insufficient labeled signals, we propose a SSL framework, HTF-PSA-SSL, which can effectively improve the modulation recognition accuracy using only a small amount of labeled data. Under 6 dB, HTF-PSA-SSL utilizes 1% of the labeled data and reaches an accuracy of 92.09%. Extensive experimental results obtained on the public dataset RML2016.10a, RML2016.10b, and RML2016.04c show that HTF-PSA-SSL also exhibits strong stability and robustness.(2)We propose the HTF mask data augmentation method and the PSA mechanism to jointly enhance the performance of HTF-PSA-SSL from data and network. The HTF mask works on unlabeled data, introducing the data perturbations and helping HTF-PSA-SSL to better adapt the different input. The PSA filters strip-shaped features from the time domain and frequency domain, respectively, which helps the network to extract key information from the STFT spectrogram, remove redundant information. Experimental results show that HTF and PSA achieves a highest acuuracy of 93.18% under 16 dB. The ablation experiments further prove the superiority of HTF and PSA in enhancing the performance of HTF-PSA-SSL.

The structure of this paper is organized as follows. In Section 2, the related works are described. Section 3 introduces the signal model. The detailed design and implementation of HTF-PSA-SSL are described in Section 4, and the evaluation results are presented in Section 5. Finally, this paper is concluded by summarizing the proposed work in Section 6.

2.1. SSL-Based AMR Methods

DL has been widely applied across multiple fields. Li et al. [21] proposed a DL-based remaining useful life (RUL) prediction method to address the sensor malfunction problem by exploring global and shared features. Zhang et al. [22] designed a blockchain-based decentralized federated transfer learning method to further address the data security and privacy problem. In AMR tasks, many SL-based methods [2330] have achieved significant success. However, the classification performance of these models relies on copious amounts of labeled data, while the amount of labeled data is limited.

To handle this problem, O’Shea et al. [18] trained an encoder to reconstruct signals which first demonstrated that SSL methods could be applied to AMR tasks. Dong et al. [19] proposed an semi-supervised signal recognition convolutional neural network (SSRCNN) to make network more robust by using Kullback–Leibler (KL) divergence loss. Furthermore, Luo et al. [31] introduced the deep cotraining method to construct different sample views by using two CLDNNs [32] with different long short-term memory (LSTM) units which achieves better classification accuracy. Liu et al. [33] build a semi-supervised automatic modulation classification framework (SemiAMC) to efficiently extract features from unlabeled signals. Li et al. [34] introduced generative adversarial networks (GANs) to achieve high-recognition accuracy in AMR tasks. Moreover, Li et al. [35] designed a spatial signal transform module which improves the training stability of the whole SSL framework. And Kim et al. [36] proposed a denoising autoencoder-based relation network which can effectively extract information from the limited labeled signals. However, these SSL methods suffer from low SNR, model instability, or high complexity.

Motivated by this, we introduce the famous MT SSL model as our main framework. And we propose a HTF-mask augmentation method to enhance the model stability. Furthermore, an attention module named PSA is designed to improve the performance of model even under low SNR.

2.2. Data Augmentation Methods

There are serveral traditional augmentation methods applied in AMR tasks. O’Shea et al. [18] generate signals by adding different Gaussian noise to I/Q channel. Liu et al. [33] augmented unlabeled signal by rotating the given IQ signals with an angle randomly selected from . Furthermore, Luo et al. [31] designed a augmentation method by exchanging the two channels of IQ signals. These methods act on signal directly. However, in scenarios with low SNR, these methods may resulting in the instability of model.

To further enhance the quality of augmented signals, a couple of augmentation methods in computer vision are being searched. Zhang et al. [37] proposed augmentation methods Mixup to extend the training distribution via linear feature interpolation which address the low performance achieved when evaluating adversarial examples. However, it has the problem of unnaturally mixing samples. Yun et al. [38] presented another method named Cutmix by randomly cutting a sample patch and pasting it in the corresponding position of another sample. It enhances the generalization ability of the constructed model but suffers from a fixed square mask shape. Harris et al. [39] designed Fmix which mixes two different samples with a random binary mask obtained by applying a threshold to low-frequency images sampled from the Fourier space. However, the irregular shape of mask may generate negative samples and affect the network performance when applying to the STFT spectrograms.

Motivated by this, we propose a data augmentation method named the HTF mask. It augment the STFT spectrogram in both temporal and frequency domain to enlarge the amount of unlabel data thus enhance the stability of network.

2.3. Attention Mechanism Methods

Attention mechanism can tell a model where to focus, it also enhances the representation of features. Hu et al. [40] proposed a squeeze-and-excitation (SE) module to efficiently build interdependencies between channels. Qin et al. [41] presented a multispectral channel attention framework (Fca) to assigns varying weights to different channels by producing different frequency components of the discrete cosine transform for each channel. However, these methods only focus on channel attention and lack spatial attention. Woo et al. [42] designed a convolutional block attention mechanism (CBAM), which takes into account both channel and spatial attention. Linsley et al. [43] proposed the global-and-local attention (GALA) module that integrates local saliency and global contextual signals to guide attention toward image regions. However, these methods cannot efficiently capture the strip-shaped features of STFT spectrogram.

Motivated by this, we designed an attention mechanism named PSA. It performs two 1D pooling operations along the horizontal and vertical axes of the feature map to enhance the strip-shaped features of signal.

3. Signal Model

We consider a single-input single-output communication system, and the received signal can be represented bywhere denotes the modulated signal from the transmitter, denotes the path loss or gain term on the signal, and refers to additive white Gaussian noise (AWGN). Then, the received signal is sampled times to obtain a complex-valued discrete-time signal with a length of .

The discrete Fourier transform (DFT) can only reflect the properties of the signal in the frequency domain and cannot analyze the signal in the time domain. Therefore, we transform the 1D signal into 2D time–frequency spectrograms using the STFT to associate it with the time domain. The calculation formula for a given discrete signal can be denoted as follows:where is the window function, and the size of the Hanning window is set to N/8.

4. Proposed Approach

4.1. Overview of HTF-PSA-SSL

HTF-PSA-SSL aims at constructing an SSL framework to effectively classify the radio signal modulation types. Figure 1 provides an overview of our proposed HTF-PSA-SSL. The training set can be divided into two parts: a small set of labeled data , a large set of unlabeled data , and consisting of STFT domain data. To introduce model perturbations, MT models are built in our framework, where the student model is trained by the loss function and the teacher model is an exponential moving average (EMA) of . For the small amount of labeled STFT spectrogram, we use the supervised cross-entropy loss over labeled samples to constrain the learning direction of the network gradient descent. For the large amount of unlabeled STFT spectrogram, we apply the data augmentation of HTF Mask on , especially we use sample pair to obtain augmented data pair (described in 4.2). Consequently, for the sample pair and its augmented data , we introduce a pseudo label [44] to label the sample pair (described in 4.4.2) and design the unsupervised cross-entropy loss base on entropy regularization to correct the learning direction of with the confidence. For augmented data pair , we design the unsupervised normalized temperature-scaled cross entropy (NT-Xent) [45] loss for consistent regularization. The whole training process of HTF-PSA-SSL can be summarized in Algorithm 1.

4.2. Augmentation Policy

Considering that insufficient training data potentially leading to inadequate network training, selecting an effective data augmentation policy is of utmost importance. As shown in Figure 2, Fmix [39] augments data with irregular masks, but these irregular masks are not suitable for physically meaningful STFT spectrogram. Inspired by this, we propose a novel radio data augmentation approach based on HTF mask. It is not only specifically designed for dealing with STFT spectrogram, but also for the unlabeled data. Simultaneously, we incorporate a pseudo label [44] into the augmented unlabeled data. More details of pseudo label is discussed in 4.4.2.

We aim to construct an augmentation policy that acts on the STFT spectrogram directly, which helps the network learn useful features. Motivated by the goal that these features should be robust to deformations in the time direction, deformations in the frequency information, and partial replacement of small segments of the radio signal, we have chosen the following deformations to make up a policy:(1)Time masking is applied so that consecutive time frames are masked, where is first chosen from a uniform distribution from 0 to the time bin , and is chosen from .(2)Frequency masking is applied so that consecutive STFT frequency channels are masked, where is first chosen from a uniform distribution from 0 to the frequency mask parameter , and is chosen from .

According to the above policy, we designed three time masks and three frequency masks based on Haar feature template, seen in Figure 3. For each mixing operation, we randomly select two integer values from a uniform distribution ranging from to and generate a pair of masks based on the selected indices. An example of two augmentations applied to a pair of inputs describes the augmentation process in details. Given a pair of mask and , we can generate the augmented data pair by the below strategywhere denotes a binary mask indicating regions for dropout and replacement within two STFT spectrogram, is a binary mask filled with ones, and is element-wise multiplication. Each HTF mask has a mean value of 0 and different shapes work with the different extraction functions.

4.3. Attention Mechanism

Different from the classic attention mechanism in the CV field, PSA adapts itself to the signal strip shape of the STFT spectrogram and helps the network focus more on the important signal features.

Since a convolutional layer has difficulty capturing this strip-shaped relationship due to its limited receptive field, PSA performs two 1D pooling operations along the horizontal and vertical axes of the feature map. Furthermore, the network should be sensitive to the local position of this strip shape. Then, PSA performs a pooling operation along the channel axis of the feature map.

PSA is added after each max pooling layer. As illustrated in Figure 4, given the upper feature map as input, PSA sequentially generates attention maps and . The overall process can be summarized as follows:where refers to the element-wise multiplication operation. During multiplication, the values generated in the spatial step are broadcasted along the channel dimension.

4.3.1. Positional Step

Two 1D average pooling layers are applied to aggregate the feature relationships along the horizontal and vertical axes. Then, we utilize a 2D convolution layer with a kernel size of to reduce the feature channel; thus, the computational resource required is appropriately lowered. After that, we calculate the correlation matrix between these axes. Finally, another 2D convolution with a kernel size of is employed to reconstruct the attention map and multiply it with the input feature, eventually obtaining the final attention map. The above operations can be summarized as follows:withwhere denotes matrix multiplication, denotes the sigmoid function, and refers to the 2D convolutional layer.

4.3.2. Spatial Step

We first perform average pooling and max pooling operations along the channel axis and then concatenate the results to generate an efficient feature descriptor. Subsequently, we utilize the convolution layer with a kernel size of to generate the final spatial attention map. Finally, we multiply this map with the input feature. The specific calculation process can be defined as follows:where denotes the concatenation operation.

4.4. Training Loss

To specifically illustrate the training procedure of HTF-PSA-SSL, we first clarify some notations. Let and represent the labeled sample set and unlabeled sample set, respectively, and and are the batch sizes of the labeled sample and unlabeled sample during the training process. Moreover, the total number of signal classes to be classified is referred to as .

4.4.1. Supervised Cross-Entropy Loss

To efficiently guide the learning direction of the whole network, the cross-entropy loss is introduced to calculate the total loss of the labeled data. We feed a labeled sample into the student network . Specifically, is first fed into the extractor to collect abundant useful features, and then these useful features are fed into the classifier to generate an output prediction, which is denoted as . After that, we use and to calculate the supervised loss as follows:

4.4.2. Unsupervised Cross-Entropy Loss

Although the unlabeled sample set is much larger than the labeled sample set in terms of quantity, we also attempt to apply the cross-entropy loss to the unlabeled sample set. Therefore, for an unlabeled sample , we need to construct a fake label, namely, a pseudo label, and suppose that it is the true label of . More specifically, we feed into the extractor and classifier in sequence and obtain its corresponding predicted output vector . To construct a pseudo label, we assume that the predicted vector is credible.

After constructing pseudo labels for all unlabeled data, we sample and apply the HTF mask to mix into a pair of new samples according to Equation (3). Then, we fetch the pre-preserved pseudo label according to the sample index and generate the mixed pseudo label as follows:

The mixed sample is then fed into the student network to generate the predicted vector . The prediction is utilized to calculate the total loss of the unlabeled data. Since is an augmented data sample from , when calculating the final loss, we introduce the mixed cross-entropy loss function. The mixed cross-entropy loss function can be denoted as follows:

Require:: student model with trainable parameters
Require:: teacher model with parameters equal to moving average of
Require:: labeled samples set
Require:: unlabeled samples set
Require:: learning rate of student model
Require:: rate of moving average
Require:: weight of unlabeled loss
Require:: Gaussian ramp–up curve function
Require:: batch size of labeled data
Require:: batch size of unlabeled data
Require:.
 1: for t = do
 2: Sample
 3: Calculate via Equation (8).
 4: Sample ,
 5: Generate pseudo label ,
 6:
 7: Calculate via Equation (10).
 8:
 9: Calculate via Equation (14).
 10:
 11:
 12:
 13:
 14: end for
 15: return,
4.4.3. Unsupervised NT-Xent Loss

When a percept is slightly changed, the smaller the angles between these high-dimensional features, the closer these classes are. Motivated by this, we apply the contrastive loss to maximize the agreement between different examples augmented from the same signal sample. Specifically, we obtain the latent high-dimensional feature representation generated from the extractor and calculate the NT-Xent loss between them. To minimize NT-Xent, we use cosine similarity to measure the similarity between two augmented samples and . The cosine similarity measure is defined as follows:where denotes the norm of . Then, the loss for a positive pair of samples is defined as follows:withwhere represents the temperature when calculating the cosine similarity value and is an indicator function that is equal to iff . To obtain the final loss, the average loss values of all positive sample pairs, including , are calculated, which can be denoted as follows:

4.4.4. Final Loss

The final total loss function of HTF-PSA-SSL can be defined as follows:where the weight is a hyperparameter that balances the labeled loss and unlabeled loss. The function is the Gaussian ramp-up curve function [46], which can be defined as follows:where is the number of current epochs and refers to the starting epoch when the unlabeled weight is equal to . The application of ensures that the training process with the labeled data is not disturbed even with the existence of an unlabeled loss. Moreover, its slow increase helps the pseudo labels of unlabeled data become closer to the true labels.

The whole training process of HTF-PSA-SSL can be summarized in Algorithm 1.

5. Experiments

We evaluate the effectiveness of the proposed HTF-PSA-SSL approach and compare it with the other existing SSL-based methods on the public RML2016.10a dataset. To further demonstrate the robustness of the proposed HTF-PSA-SSL method, we also examine its performance on the other public datasets, RML2016.10b and RML2016.04c.

5.1. Datasets and Training Environment
5.1.1. Dataset Descriptions

RML2016.10a is a synthetic dataset consisting of 11 modulation types (8 digital and 3 analog), which are 8PSK, AM-DSB, AM-SSB, BPSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QPSK, and WBFM. For each modulation type, there are 20 different SNRs varying from −20 to +18 dB with an interval of 2 dB. For each SNR, there are 1,000 signals. Each signal is composed of I and Q parts, and the size is . In this paper, for each raw IQ signal, we apply the STFT function to generate a spectrogram with a size of and feed it into the network as our input.

5.1.2. Implementation Details

We split the dataset into three parts, training, validation, and testing sets, according to the ratio of 8 : 1 : 1. Then, we select a certain percentage of samples from the training set as the labeled dataset, while we select the remaining samples of the training set as the unlabeled dataset. Specifically, we randomly divide 1,000 signals into three groups: 800 signals for training, 100 signals for validation, and 100 signals for testing. Then, the training set is divided into two parts: 88 signals for the labeled dataset and 712 signals for the unlabeled dataset. The other dataset is divided in the same way.

The adaptive moment estimation (Adam) optimizer is applied for all of our experiments, and each experiment runs for 150 epochs. The initial learning rate is set to 0.001, which is then adjusted with cosine decay. The moving average rate of the EMA is set to 0.99. The batch size is set to 64 for both the labeled dataset and unlabeled dataset. The weight of the unlabeled loss is set to 20, and the number of epochs for the ramp-up function is set to 30. Our experiments are implemented in PyTorch with Python 3.7 using two Nvidia 3090 graphics processors.

5.2. Comparison with Supervised Methods

We consider two supervised scenarios: supervised (100%) and supervised (1%). Supervised (100%) means that we train a network using the full training set with labels. Supervised (1%) refers to training the network using only 1% of the whole training set. The comparison results obtained under 6 dB signals are shown in Table 1. We can determine that the performance of HTF-PSA-SSL is basically between those of the two supervised methods, which is 0.91% lower than supervised (100%) but 16% higher than supervised (1%). This shows that HTF-PSA-SSL can effectively make use of unlabeled data to improve its recognition performance.

Then, we evaluate the performance of the three methods from −20 to +18 dB, and the comparison results are shown in Figure 5. It is obvious that HTF-PSA-SSL keeps approaching the supervised (100%) method and outperforms the supervised (1%) method by 13.88% on average. This further verifies the powerful feature extraction ability of HTF-PSA-SSL, which can extract sufficient reliable features from an unlabeled dataset and continuously improve its model performance. We also compare the performance of HTF-PSA-SSL with some Machine Learning (ML) methods. These ML methods are trained using the full training set with labels. As shown in Figure 5, at a SNR of 16 dB, HTF-PSA-SSL outperforms support vector machine (SVM) [47] by 52% and random forest (RF) [48] by 50% in terms of recognition accuracy. Moreover, we show the feature visualization of instantaneous statistical features [47], entropy features [48], and high-dimensional features (HTF-PSA-SSL) using t-distributed stochastic neighbor embedding (t-SNE) [49]. The feature distribution under 12 dB signals is shown in Figure 6. It is obvious that the features of HTF-PSA-SSL are well-aggregated, while the instantaneous statistical features and entropy features are scattered. Since these manual features are more severely confused than the high-dimensional features, the performance obtained by SVM and RF is also lower than that of HTF-PSA-SSL. However, HTF-PSA-SSL has a higher computation complexity. The comparison of computational complexity is presented in Table 2.

From the confusion matrix obtained under −6, 0, and 12 dB signals, drawn in Figure 7, we observe that with increasing SNR, the recognition accuracies achieved for most modulation types are improved. However, AM-DSB and WBFM are heavily confused even at 12 dB, which indicates that this pair of modulation classes is difficult to correctly recognize.

5.3. Efficiency of the HTF Mask and PSA

As shown in Figure 8, the HTF mask augmentation method achieves the best classification accuracy compared with the Mixup method [37] and the Fmix method [39] from −20 to +18 dB. More specifically, the HTF mask augmentation method has a higher accuracy than Mixup method due to the application of mask augmentation form and it has a better performance than Fmix method because HTF mask method has taken the temporal and frequency correspondence of STFT spectrogram into account. For instance, the HTF mask method outperforms Mixup by nearly 12% and surpasses Fmix by almost 4% on the average of −20 to 18 dB. Under 16 dB, HTF mask achieves a highest accuracy of 93.18%.

Then, we simply replace PSA with the other famous attention methods, SE [40], the CBAM [42], and Fca [41], and evaluate their performance under the same experimental settings. The experimental results are shown in Figure 9. The recognition accuracy of PSA is higher than that of the other three attention mechanisms by 2.67%, 5.13%, and 7.75% on average, especially when the . This is due to the strong detail extraction ability of PSA, which establishes the horizontal and vertical long-term dependencies of features to effectively capture the detailed information contained in signals.

5.4. Ablation Study

To showcase the efficiency of the proposed HTF mask and PSA method, we perform a range of ablation experiments at 10 different SNRs. Furthermore, we evaluate the performance of training loss, the corresponding results are all listed in Table 3.

5.4.1. HTF Mask

As shown in Table 3, augmentation method with only frequency mask or time mask reach higher classification accuracy compared to no augmentation method Eps13 [44]. Furthermore, by augmenting data in both the temporal and frequency domains, HTF-PSA-SSL achieves superior results, as demonstrated in Table 3. For example, HTF-PSA-SSL outperforms Eps13 by 22.31%, Eps13 with frequency mask by 16.18%, and Eps13 with time mask by 10.63% under −4 dB. These results show the effectiveness of the HTF mask in augmenting spectrogram data, such as the STFT spectrogram.

5.4.2. PSA Method

From Table 3, we can figure out that model with spatial attention or positional attention achieve higher classification accuracy than no attention model CNN5. However, when both the positional attention and the spatial attention are performed, HTF-PSA-SSL reaches the highest accuracy. For instance, HTF-PSA-SSL outperforms CNN5 by 13.68%, CNN5 with spatial attention by 6.28%, and CNN5 with positional attention by 3.73%, respectively, under 0 dB. These experiments indicate that the PSA can efficiently enhance the strip-shaped features of STFT spectrogram.

5.4.3. Traing Loss

As shown in Table 3, the absence of unsupervised cross-entropy loss leads to the significant decline in classification accuracy. This indicates that the plays an important role when training network. Moreover, when both the unsupervised cross-entropy loss and the unsupervised NT-Xent loss are added, HTF-PSA-SSL achieves the best classification accuracy. For example, under 4 dB, HTF-PSA-SSL outperforms supervised cross-entropy loss by 18.09%,  +  by 1.18%, respectively. This further shows that the loss function of HTF-PSA-SSL is indeed useful.

5.5. Comparison with Other SSL-Based Methods

In this experiment, we evaluate the performance of three SSL methods applied in the signal field: SSRCNN [19], SimAMC [33], and EDCT [31]. As shown in Figure 10, HTF-PSA-SSL reaches higher recognition accuracy than the other three SSL-based methods, it outperforms SSR by 35.84%, SimAMC by 35.28%, and EDCT by 13.01%. This fully demonstrates the strong performance of the proposed HTF-PSA-SSL technique. It can extract more critical information from spectrograms and screen out the practical features from this information.

5.6. Robustness of the Proposed HTF-PSA-SSL Method

In this part, we study the robustness of HTF-PSA-SSL by evaluating its recognition accuracy on three public datasets, RML2016.10a, RML2016.10b, and RML2016.04c, and the specific recognition accuracies are shown in Table 4. RML2016.10b contains 10 modulation classes, but the total size of the dataset is much larger than that of RML2016.10a. For each SNR, each modulation type has 6,000 signal samples. RML2016.04c has the same modulation classes as RML2016.10a, but the number of samples for each modulation type is different, ranging from 207 to 1,248. For each SNR, there are 8,103 signal samples in total, including all modulation classes. We can determine that HTF-PSA-SSL achieves the best recognition accuracy on these three public datasets, which indicates that HTF-PSA-SSL is robust and can achieve stable recognition performance, regardless of whether the given dataset is large or small and whether the numbers of samples in different classes are balanced or not.

As shown in Table 5, we evaluate the recognition accuracy of the label rate 1% (88), 5% (440), and 10% (880) at some SNRs. The results show that with the increase in the amount of labeled data, the recognition accuracy of HTP-PSA-SSL is also gradually improved, but the improvement fluctuates at 1%. This indicates that increasing the amount of labeled data can improve the recognition accuracy of HTP-PSA-SSL but the improvement effect is not obvious. This is because the HTF mask improves the robustness of the network. Even with only a small amount of labeled data, the recognition accuracy is close to that of supervised learning using the whole training set.

To visualize the features of the test signal samples, we obtain the intermediate features of the classifier and utilize t-SNE for dimensionality reduction. The sample point distribution under 0 dB signals is shown in Figure 11. It is obvious that the results supervised (100%) and HTF-PSA-SSL are well-aggregated, while those of supervised (1%) are scattered and heavily confused with QAM16 and QAM64, QPSK and 8PSK, further indicating that HTF-PSA-SSL is highly reliable. From these figures, we can additionally conclude that WBFM and AM-DSB are difficult to recognize for HTF-PSA-SSL as well as EDCT, SSRCNN, and SimAMC.

5.7. Computation Complexity

In Table 2, the FLOPs, the parameters volume and the memory usage are compared.

5.7.1. FLOPs

Compared with other methods, HTF-SSL-PSA has the highest FLOPs which is 2.53 G. This is because we filter redundant information from the time domain, frequency domain, and global time–frequency domain, which requires a certain amount of calculation. We also compare the time complexity of SVM, RF, and HTF-PSA-SSL. From Table 6, we can figure out that HTF-PSA-SSL has the highest time complexity. In our forthcoming research, strategies to optimize network computational overhead are our research directions, such as binary neural networks.

5.7.2. Parameters

In terms of parameters, Fca has the highest value, which is 1.85 M, while HTF-PSA-SSL closely follows as the second highest with 1.82 M. This improvement can be attributed to the PSA’s powerful filtering function of key information in both the time and frequency domains. Compared to SE and CBAM, the HTF-PSA-SSL is slightly higher. This increase can be attributed to the fact that HTF-PSA-SSL also filters global time–frequency information. These domains collaboratively eliminate redundant information. While compared to the rest methods, HTF-PSA-SSL is much higher. This difference can be attributed to the model’s necessary complexity, which allows HTF-SSL-PSA to extract features from large amounts of unlabeled data and contribute to correcting the network’s learning direction. In our future work, we will continue to investigate methods for significantly reducing the computational complexity of the network while maintaining the higher recognition performance.

5.7.3. Memory

Similar to the model parameters, the memory usage of HTF-PSA-SSL also ranks second at 7.4 M. This is because the number of parameters of the network itself is large. This is also a direction for our future improvement.

6. Conclusion

In this paper, an AMR framework based on SSL is proposed to achieve improved modulation recognition accuracy by effectively utilizing large amounts of unlabeled data. While using only a small amount of labeled data, the framework can significantly improve its recognition performance. The proposed HTF mask data augmentation method can effectively mix unlabeled data and expand the total amount of unlabeled data to improve the overall generalization performance of the convolutional network. The designed attention mechanism named PSA can plug and play into any convolutional layer to compensate for the limited receptive field of the convolutional layer and enhance the feature extraction ability of the convolutional network. Compared with SSR, SimAMC, and EDCT, HTF-PSA-SSL achieves , , and higher accuracy on average. Compared with supervised (1%), HTF-PSA-SSL improves the recognition accuracy by 13.88% on average. Extensive experiments and comparisons conducted on public datasets show that the proposed framework can effectively use a large amount of unlabeled data and accurately predict the modulation types of unknown signals with very little labeled data.

Data Availability

The data used to support this study are public datasets. They can be downloaded from http://radioml.com [14].

Conflicts of Interest

The authors declare that they have no conflicts of interest.