Abstract

As network supporting devices and sensors in the Internet of Things are leaping forward, countless real-world data will be generated for human intelligent applications. Speech sensor networks, an important part of the Internet of Things, have numerous application needs. Indeed, the sensor data can further help intelligent applications to provide higher quality services, whereas this data may involve considerable noise data. Accordingly, speech signal processing method should be urgently implemented to acquire low-noise and effective speech data. Blind source separation and enhancement technique refer to one of the representative methods. However, in the unsupervised complex environment, in the only presence of a single-channel signal, many technical challenges are imposed on achieving single-channel and multiperson mixed speech separation. For this reason, this study develops an unsupervised speech separation method CNMF+JADE, i.e., a hybrid method combined with Convolutional Non-Negative Matrix Factorization and Joint Approximative Diagonalization of Eigenmatrix. Moreover, an adaptive wavelet transform-based speech enhancement technique is proposed, capable of adaptively and effectively enhancing the separated speech signal. The proposed method is aimed at yielding a general and efficient speech processing algorithm for the data acquired by speech sensors. As revealed from the experimental results, in the TIMIT speech sources, the proposed method can effectively extract the target speaker from the mixed speech with a tiny training sample. The algorithm is highly general and robust, capable of technically supporting the processing of speech signal acquired by most speech sensors.

1. Introduction

As information technology is advancing and 5G technology is being popularized, Internet of Things (IoT) devices and sensors will be increasingly created, which will undoubtedly change the way human beings live. Moreover, sensor networks are being progressively studied [13]. It is predicted that in the next decade, billions of IoT and sensor devices will generate massive data for applications in smart grid, smart home, electronic health, industry 4.0, etc. It is foreseeable that intelligent speech systems will be critical to the mentioned areas. With the rapid growth of data volume, large-scale problems should be urgently solved effectively [4, 5], while more opportunities are brought. Speech sensor networks, an important part of IoT, will have many application needs. However, in real-world scenarios, the data acquired by speech sensors are often disturbed by noise. Thus, low-noise and effective speech data should be urgently obtained.

With the increasing number of speech sensors, reliable speech separation technology is required [68]. High reliable speech separation technology is capable of achieving effective speech recognition, so the needs of human hearing can be satisfied. Speech separation originates from blind source separation (BSS) [9]. The core goal of this technology is to separate the source signal from the measured mixed signal. In the blind source analysis task, the target speech should be separated from the mixed speech in a single channel, which is very difficult to achieve. Single-channel speech separation is a hotspot in the current research. Many algorithms are proposed for single-channel speech separation, but from the current research results, this problem is far from being well solved. We believe that the current challenges are mainly manifested as the following aspects. (1)Strong noise and unknown number of sources still significantly enhance the performance of BBS. Indeed, most of the existing blind source separation algorithms have achieved ideal performance in high SNR (Signal-to-Noise Ratio) environment. In practical applications, the signal we collected may have been polluted by strong noise. Because of this, many reported algorithms in blind source separation are very likely to obtain poor separation performance and even cannot correctly deal with the severely distorted signal in extreme cases. For the mentioned reason, to obtain robust blind source separation algorithm, a more effective method is required to suppress the impact of noise. In addition, a more difficult problem is that the number of sources is unknown. On the whole, the number of sources should be assumed, whereas in practical applications, information on the number of sources is not available, which cannot be ignored [10]. Accordingly, blind estimation of the number of sources from the received mixed signal cannot effectively obtain the ideal BSS performance(2)The processing complexity of single-channel speech separation is higher than that of multi-input speech separation. In numerous practical applications, the challenge of blind source separation is that only one sensor is available, namely, SCBSS (single-channel and blind source separation) [1113]. It uses only a single receiver sensor to receive the observed signal and then uses the signal to recover each source signal. Generative adversarial network (GAN) is an excellent representative of deep learning algorithms and is also used in SCBSS due to its advantages in fitting data distribution (e.g., 1D speech signal separation [1416]). However, the performance of GAN is limited by the unknown number of source signals, complex forms of dialogue, serious noise pollution, and difficulty in obtaining prior information in advance. Such a type of SCBSS is characterized by unknown number of source signals, complex dialogue form, serious noise pollution, and difficulty in acquiring prior information in advance. To solve this type of problem, unsupervised learning method should be developed, whereas automatic analysis should be extremely difficult to realize based on unsupervised learning method (overall, single-channel speech only requires a single signal source, which is easier to achieve and more realistic than multichannel speech)(3)The solution to solve BBS problem refers to employing supervised learning mechanism. The more representative is the deep learning method. It has been recently found that deep learning [17, 18] has achieved remarkable success in many speech processing fields with its excellent learning performance. The representative technology is DNN-HMM hybrid structure [19, 20], replacing the conventional acoustic modeling based on GMM and HMM. In single-channel speech separation, a method based on DNNs [21, 22] has been proposed to separate the target speaker from the mixed speech. However, all deep learning algorithms use joint a decoding framework, which requires additional computational complexity. Moreover, deep learning algorithm needs considerable training data, which is difficult to extend to small data sets and unsupervised speech separation scenarios

To reduce the above challenges, an unsupervised speech separation method CNMF+JADE is proposed in this study, i.e., a hybrid method combined with Convolutional Non-Negative Matrix Factorization [23, 24] and Joint Approximative Diagonalization of Eigenmatrix [25]. This study is aimed at performing efficient processing for the highly noisy signal data acquired by the speech sensor to achieve better separation performance. CNMF refers to a nonnegative matrix decomposition method proposed for speech signal processing. The method adopts a 2D time-frequency basis instead of the 1D basis vector in the original nonnegative matrix decomposition, while it ensures the decomposition result to be nonnegative matrix decomposition. Thus, it effectively carries the correlation between local frames of speech signals [26]. JADE is recognized as an adaptive batch independent component optimization algorithm based on multivariate fourth-order cumulative matrix, and it is an effective method for blind source separation. It exploits the feature that mutual accumulation is always zero when signals are independent and builds multiple fourth-order accumulation matrices for multivariate data. Lastly, the mentioned cumulant matrices are jointly diagonalized to solve for the final separated signals [27, 28]. For single-channel signal, CNMF+JADE can effectively separate the overlapped speech including the target speaker. Subsequently, CNMF+JADE with adaptive speech enhancement technology is adopted to further improve the speech quality of the target speaker. To solve the problem of SCBSS, the main innovations can be summarized below. (1)In this study, CNMF and JADE are combined to solve the problem of single-channel speech separation. The algorithm is appropriate in extracting signals of interest from mixed signals. Specific to SNR (Signal-Noise Ratio), STOI (Short-Time Objective Intelligibility), and PESQ (Perceptual Evaluation of Speech Quality), the proposed CNMF+JADE, as compared with several speech separation methods (CNMF, CNMF+ICA), achieves satisfactory results, especially for single-channel mixed speech(2)Given the scenario that the speech signal will get worse when the speech signal is enhanced after speech separation, an adaptive method is presented here based on wavelet transform to analyze the speech signal after CNMF+JADE separation, as an attempt to realize selective speech enhancement and increase the efficiency of speech enhancement

The rest of the study is organized as follows. In Section 2, some related studies on the study of single-channel speech separation are presented. In Section 3, the proposed algorithm is elucidated. In Section 4, a specific experimental verification of the performance of the proposed algorithm is presented. Lastly, in Section 5, the conclusion and promising future research directions are drawn.

As IoT technology is developing, intelligent voice system will have increasingly broad application prospects. In addition, single-channel blind speech separation (SCBSS) technology will arouse wide attention. At present, there are three main directions for SCBSS research: (1)Subspace Decomposition-Based Approach [29]. Methods based on subspace decomposition primarily are aimed at identifying new descriptions. The mentioned new descriptions can often effectively extract perceptive meaningful component sources from complex mixtures [30]. Moreover, new descriptions can eliminate intrusions and reduce signal dimensionality, so redundant components can be avoided. The methods based on subspace decomposition are primarily well established in statistical and transformed data. For instance, in the literature [31], the effectiveness of Principal Component Analysis (PCA) and Independent Component Analysis (ICA) methods in solving subspace decomposition problems has been verified. In fact, methods based on algebraic properties are more often used in dealing with subspace decomposition problems, including Non-negative Matrix Factorization (NMF) [32]. NMF is a classical time-frequency distribution method and is often used for single-channel speech separation [3337]. Ref [38, 39] highlighted NMF as an unsupervised dictionary-based learning method that effectively helps solve various types of signal separation(2)Model-Based Approach. In the first step of the model-based approach, each speaker in the model scene should be identified, and the gain in the blended frames should be determined. In fact, speaker recognition algorithms have been studied by many authors (e.g., Iroquois [40], Closed loop [41], and Adaptive Speaker Identification (SID) [42]). The next step is to choose an appropriate speech representation. The final step comprises the reconstruction of the speech signal frames, in which separated speech is produced. Overall, the reconstruction usually requires the construction of a hybrid estimator module that enables it to find a sufficient number of representative speech frames from the speaker model to rebuild a meaningful speech signal. However, mixture estimators are capable of significantly complicating the algorithm, so it is difficult to apply in real-time systems(3)Computational Auditory Scene Analysis- (CASA-) Based Approach. CASA runs in two main stages, i.e., segmentation and grouping. The former comprises feature extraction, time-frequency analysis, and multitone tracking, while the latter includes the resynthesis of speech signals. To be specific, pitch tracking is an important technique when CASA is being used for SCBSS problem processing. Jin [43] and Tolonen [44] provided several pitch tracking methods that are used extensively. However, as impacted by the periodic nature of the grouping phase, it can only be limited to voiced speech segments. Moreover, the performance achieved by CASA-based methods tends to be affected by multipitch estimation for its dependence on pitch

Over the past few years, with the development of deep learning, researchers have suggested that the nonlinear processing and feature learning capabilities of deep models exhibit significant advantages in solving speech separation problems. For this reason, many models using deep learning for speech separation have been proposed (e.g., Deep Neural Network (DNN), deep stacking, Deep Stack Neural Network (DSN) [45], and other efficient deep learning models [4650]). In addition, numerous deep learning algorithms have been proposed for single-channel speech separation [5154]. The reason why deep learning is so effective in addressing with speech separation problems is that the speech separation problem is described as a supervised problem in the deep learning model. Thus, deep learning models can train and learn features from speech signals to effectively separate speech signals.

3. Methodology

In the present section, the methods we use for processing speech data are described, and a new algorithm with high generality and robustness is proposed, aiming to provide a general and efficient speech processing algorithm for the data acquired by speech sensors.

3.1. Speech Separation

(1)CNMF. Speech signals exhibit local interframe correlation and global interframe correlation. The conversion of local interframe correlation should consider two aspects, i.e., to ensure the continuity between frames of the converted voice channel spectrum, as well as to remove the source speaker features from the local interframe correlation and make it have the target speaker features. However, the conventional nonnegative matrix factorization does not consider the conversion of local frames. CNMF refers to a proposed nonnegative matrix decomposition method for speech signal processing. The method employs a 2D time-frequency basis instead of the 1D basis vector in the original nonnegative matrix decomposition while ensuring the nonnegativity of the decomposition result. Thus, the correlation between the local frames of the speech signal is carried effectively

The CNMF is expressed as follows: where and represent the time-frequency atoms and the corresponding time-varying gain coefficients, respectively. denotes shifting the encoding matrix by units to the right in the form of column vectors and set the leftmost column to 0.

In other words, the decomposition matrix is obtained by convolving a series of nonnegative fundamental matrices and coefficient matrices . The functions of CNMF are to find a series of fundamental matrices and coefficient matrices and then make the convolution result as close as possible to the target matrix .

In addition, the divergence acts as the cost function in CNMF: where denotes the estimation of , and

makes the maximum log-likelihood solution of solving the nonnegative matrices and under the Poisson noise assumption to describe the degree of approximation of with respect to . The iterative function can be defined as follows. where indicating the matrix with all elements of 1 and is the matrix element multiplication operator. When , i.e., is 0, it will degenerate into the basic NMF decomposition. For each , there is a basic matrix corresponding to it.

3.1.1. JADE

The Joint Approximate Diagonalization of Eigenmatrices (JADE) algorithm is an adaptive batch independent component optimization algorithm based on multivariate fourth-order cumulative matrices and an effective method for blind source separation. JADE mainly uses the diagonalization of Jacobi matrix to find the independent components, as an attempt to achieve the identification and separation of signals. Based on the characteristics of JADE mentioned above, JADE is introduced to effectively separate the acquired speech signals.

JADE algorithm first spheres the observed signal using an spherization matrix to obtain the observation vector for channels. Then, let be any matrix, then the definition of the four-dimensional cumulant matrix of is: where denotes the fourth-order cumulant of the , , , and components in the vector.

3.1.2. CNMF+JADE

However, in the same channel spectral matrix , the final and obtained by CNMF analysis are not the same when the initial values of and are different, i.e., the same time-frequency spectral matrix has multiple combinations of time-frequency bases and coding matrices. For the mentioned reason, if the parallel channel spectral matrices of the source and target speakers are analyzed independently by convolutional nonnegative matrix decomposition, the same encoding matrix that characterizes the content information is not ensured to be obtained. From the analysis described in Section 3.1.1, JADE is known as an adaptive batch independent component optimization algorithm based on multivariate fourth-order cumulative matrices and an effective method for blind source separation, capable of effectively identifying and separating signal, which achieves the obtained signals as identical as possible.

Accordingly, to efficiently process the speech signals collected by the speech sensors, a single-channel speech separation algorithm combining CNMF and JADE is proposed. The secondary separation process is performed on the speech signal separated by CNMF based on JADE. The role of CNMF+JADE algorithm is to separate the single-channel mixed speech and lastly acquire the separated speech signal of all speakers in the mixed speech. The algorithm exhibits strong generality and robustness, capable of technically supporting the processing of speech signals collected by most speech sensors. For instance, in the literature [55], several applications (e.g., beamforming, automatic camera steering, robotics, and surveillance) are processed with the speech separation method. In [56], a speech signal separation method is adopted for speech separation of noisy robust speech translation for general-purpose smart devices. It is foreseen that speech separation techniques are also critical to future applications of IoT technologies (e.g., driverless, smart home, and other applications involving sound conduction functions). For this reason, it is of great value and significance to propose more efficient speech separation algorithms (e.g., CNMF+JADE) as proposed in this study.

Lastly, the CNMF+JADE algorithm is described as follows.

Input: Speech signal dataset, , and O.
1: Initialize each parameter and variable:
T=, expresses the set of all the pure speech signal data of the speaker waiting to be separated,
H=, expresses the set of all the mixed speech that is used as the training set,
O denotes a mixed speech waiting to be separated,
is a random matrix.
2: whilei < Ndo
3:  The speech data with the identical subscript and from the datasets T and H are selected to train CNMF.
4:  The trained CNMF is employed to separate the mixed speech dataset O to determine and .
5:  The two speech signals acquired from 4 are mixed to obtain a two-channel speech signal and stored in .
6:  A secondary separation is conducted by adopting JADE to obtain and from .
7:   is used as the speech signal to be separated in the next round, and and are removed from the data sets T and H.
8:  Obtain the final separated speech signal.
9:  i = i +1.
10: end while.
Output: All of speaker’s speech signals .

The proposed algorithm is written in Algorithm 1, where represent the set of all the pure speech signal data of the speaker waiting to be separated, denote the set of all the mixed speech employed as the training set, is a mixed speech waiting to be separated, is a random matrix, and N represents the number of speech signals, i.e., the number of speakers. denotes the corresponding speaker, as expressed in the dataset .

In fact, are very costly and difficult to obtain. Thus, in experiments, a speech signal different from the current target speaker is generally selected randomly from the dataset to train CNMF. Although the results obtained by this approach are slightly degraded, the proposed algorithm can be applied to more general range.

In addition, , mentioned in Table 1, is a matrix, which is represented as follows: where denotes the speech signal of the target speaker and denotes the set of speech signals obtained after the separation of all speakers.

According to the defects of some existing single-channel speech separation methods, a new algorithm combining CNMF and JADE is proposed in this study. The CNMF is first trained using the training speech signal, and the trained CNMF is used to separate the mixed speech. Next, the separated speech signals are mixed, and the secondary separation is conducted by using JADE. In the next section, simulation experiments are performed to verify the performance of the proposed algorithm and compare it with several other algorithms.

3.2. Speech Enhancement

Some noise usually remains in the target speaker’s speech after speech separation, and the interference of noise will inevitably reduce the quality and intelligibility of speech. For the mentioned reason, suppressing the background noise and extracting the pure speech becomes an important part of the speech processing process. Speech enhancement techniques should be used to enhance the target signal after speech signal separation. The conventional single-channel speech enhancement techniques comprise checkpoints [57], Wiener filtering [58], Kalman filtering [59], wavelet transform [60], and so on.

However, as reported by some existing studies, wavelet transform has more significant advantages in single-channel speech signal enhancement. Moreover, the experiments in this study prove this point. Wavelet transform is another landmark technique after Fourier transform. Wavelet transform inherits the advantages of Fourier transform while overcoming its defects. It is an ideal tool for signal time-frequency analysis and processing. One of the features of the wavelet transform in signal processing is that the transform can make certain aspects of the signal more prominent, so it is enabled to highlight signal details when processing the signal and thus extract the effective signal.

Accordingly, based on the above motivation, we will use wavelet transform as the speech signal enhancement technique in this study and propose a more effective adaptive wavelet transform to enhance the extracted signal.

In the following, the wavelet transform and the adaptive wavelet transform technique proposed in this study are introduced.

3.2.1. Speech Enhancement Based on Wavelet Transform

In the present section, we introduce the wavelet transform to enhance the sensor speech signal. The principle of wavelet transform is described below.

Set as a square integrable space, and , if its Fourier transform satisfies Eq. (9) as follows:

denotes a basic wavelet or a mother wavelet.

After the mother wavelet is scaled and translated by a real pair , where , a cluster function can be yielded:

This cluster function denotes a wavelet basis function, where represents the scaling factor and denotes the translation factor. represents a window function whose window size is fixed but its shape can be changed. According to this characteristic, the wavelet transform is characterized by multiresolution analysis. is a normalization factor, so the wavelets are enabled to have the same energy at different scales.

Signal processing based on wavelet domain is one of the main methods of speech signal processing. Wavelet transform has the characteristics of multiresolution, low entropy, and decorrelation, enabling the wavelet transform to show significant advantages in speech signals processing. Moreover, considerable wavelet bases can theoretically handle different scenarios, so the wavelet transform is significantly useful for speech signal processing.

The main process of wavelet transform denoising is shown in Figure 1, which well demonstrates the process.

3.2.2. Speech Enhancement Based on Adaptive Wavelet Transform

As suggested from the results of the experiments of this study, the quality of the enhanced speech signal may be reduced when the speech signal is enhanced after speech separation. This result proves that the speech enhancement algorithm cannot denoise properly on all noisy speech. In the present section, this study presents an adaptive method based on wavelet transform to analyze the CNMF+JADE separated speech signals and try to achieve selective speech enhancement, that is, before speech enhancement, automatic filtering those speech segments may cause quality degradation. As indicated from the analysis of the speech signal after separation and the speech after wavelet transform, under the significant difference between the separated speeches, the quality will reduce while increase with the wavelet transform. Based on the mentioned findings, the following method is developed to process adaptive judgment before speech enhancement. where denotes the th target speaker speech signal after CNMF+JADE separation. is the mixed speech signal after the CNMF+JADE separation on the mixed speech . and , respectively, express the Gaussian Mixture Model (GMM) [61, 62] of and . indicates the loss during the separation process. represents the number of speakers included in the mixed signal. is the scaling factor, and the value is [1, 1.2].

represents the GMM distance calculation formula, as defined below:

The function of is to measure the dispersion between and , i.e., the coupling degree, and is the weight.

Equation (11) can be explained as under the low coupling between and obtained by CNMF+JADE separation, no further speech enhancement is performed. In other words, under the 0 value obtained from Eq. (11), it is considered that the better the separation effect of the CNMF+JADE algorithm, the less noise the separated speech will contain, and then, further speech enhancement may be counterproductive. Furthermore, under the value of 1, the experimental wavelet transform is considered to be required for separation again.

Equation (11) adaptively determines which separated signals should be enhanced again and which ones do not, so the separated speech signals can be effectively optimized.

Finally, Figure 2 illustrates the flow of the whole algorithm.

4. Experiment Verification

As impacted by the limitations of the experimental conditions, in the present section, a sensor will be simulated to acquire speech data in a speech scene. The basic data used in the experiments originate from an acoustic-phonetic continuous speech corpus constructed in collaboration with Texas Instruments, MIT, and SRI International, i.e., the TIMIT dataset. The TIMIT dataset exhibits a speech sampling frequency of 16 kHz and comprises a total of 6300 sentences spoken by 630 individuals from eight major dialect regions in the United States. All sentences were manually segmented at the phoneme level (phone level) and then labeled. 70% of the speakers were male, and the speakers were primarily white adults. Next, multiple scenarios are simulated, and the speech signals are mixed according to the different scenarios.

For experimental design, in the first part of the experiment, different algorithms are used to separate the speech signals, and then, the signals are analyzed and compared with the algorithm proposed in this study to show that the CNMF+JADE algorithm proposed here can apply to the analysis and processing of the signal data collected by speech sensors. Subsequently, in the second part of the experiments, the performance of several single-channel speech enhancement techniques is verified, and the ability of the adaptive wavelet transform technique proposed in this study to effectively enhance the separated speech signals is experimentally verified, proving the effectiveness of the proposed method. In the following, the experiments are elucidated.

According to Table 1, the case of a speech was simulated, and two scenarios were set up. Scenario I contains two speakers and sets three specific scenarios. Scenario II contains three speakers, one of whom is the target speaker, and sets four specific scenarios. All the scenarios are set up with numbers -.

In addition, three scientific evaluation metrics are adopted to scientifically evaluate the quality of the separated speech signal. The three evaluation metrics introduced and their descriptions are elucidated below: (1)Signal-Noise Ratio (SNR) [63] is the ratio between the valid signal and the invalid signal (noise signal). The larger the ratio, the greater the proportion of valid signals will be, and the purer the signal will be(2)Perceptual Evaluation of Speech Quality (PESQ) [64] is an objective, full-reference speech quality assessment method that considers the subjective perception of human speech signals and can provide a subjective predictive value for objective speech quality assessment, which is recognized as an objective reflection of subjective evaluation. The PESQ score ranges between [-0.5,4.5], and a higher score indicates better speech quality after separation(3)Short-Time Objective Intelligibility (STOI) [65], like PESQ, refers to a common objective evaluation method that conforms to the human auditory system for speech quality evaluation. It represents the actual intelligibility of speech, with the value ranging between [0,1]. If the value is closer to 1, the more easily the separated speech will be understood, and the higher the intelligibility will be

4.1. Speaker Separation

In the present section, simulation experiments are performed to verify the effectiveness of the proposed algorithm. According to the way of sound mixing, the speech signal separation falls to mono and multichannel speech separation. Since multichannel speech signals involve more available knowledge than monophonic speech signals, multichannel speech signals are simpler to process. The common multichannel speech separation algorithms are mainly based on Independent Components Analysis (ICA) and have shown better performance. For this reason, in the present section of experiments, we selected ICA as the comparison algorithm for speech separation. However, it should be noted that our simulation experiments are based on single-sensor hybrid speech separation, which does not satisfy the application of the ICA algorithm. Thus, in this part of the experiments, we extend the ICA algorithm by combining ICA with CNMF so that it can be applied to but-channel speech separation and compare it with the algorithm proposed in this study. Lastly, the specific methods used in this study are CNMF, CNMF+ICA, and CNMF+JADE.

Table 2 shows the results of the experiments by employing different separation methods, - corresponding to several dialogue scenarios simulated above in turn. The values in the table represent the evaluated results of the target speaker’s speech and the original pure speech with the corresponding methods, in which the data corresponding to the MIX method refer to the data of the three metrics corresponding to the original mixed pure speech, and the later data are the results achieved with the three methods CNMF, CNMF+ICA, and CNMF+JADE, respectively. The best experimental results in each scenario are marked in italics.

From the experimental results in the table, we can find that the speech signals processed by all methods are significantly improved compared to the original mixed speech MIX. In addition, the CNMF+JADE algorithm proposed in this study achieves the best experimental results in almost all scenarios; among the 7 scenarios and 21 metrics, only 4 metrics are worse than the experimental results of other methods (CNMF+ICA), which are the SNR and STOI results of scenario and STOI results of scenario . Moreover, it can be seen that the experimental results evaluated using PESQ are all better than those calculated by several other algorithms, which fully demonstrates the effectiveness of the proposed algorithm.

First, the proposed CNMF+JADE algorithm is compared with the CNMF algorithm, and all experimental results are found to outperform those of CNMF, which demonstrates that combining JADE with CNMF is effective. Subsequently, as revealed from the comparison with the CNMF+ICA algorithm, almost all the results are better than those achieved by CNMF+ICA, indicating that combining JADE with CNMF is a purposeful combination and more promising. The combined experimental results fully illustrate the effectiveness of the proposed algorithm.

4.2. The First Experiment Verification for Enhancement

In this part of the experiments, the performance of several conventional single-channel speech signal enhancement techniques is compared. The separated signal complies with the signal of the target speaker obtained from the CNMF+JADE method in Section 4.1. Moreover, the CNMF+JADE method is the method proposed in this study. Subsequently, the target speech signal is enhanced with the four speech enhancement methods separately, and lastly, the enhanced speech signal is evaluated with SNR, PRSQ, and STOI. The experimentally achieved results are listed in Table 3, where the experimental results of the CNMF+JADE method represent the experimental results to be compared. Likewise, - columns correspond to the various scenarios in Table 1, in which each method is evaluated with three evaluation metrics.

First, comparing the four conventional single-channel speech enhancement methods, it can be found that the algorithm using wavelet transform as the speech enhancement method exhibits the optimal performance among the four conventional speech enhancement methods. As suggested by the experimental results achieved with SNR as the evaluation index, the wavelet transform achieves the optimal results in all seven scenarios. For the experimental results achieved with STOI as the evaluation index, six scenes also achieve the optimal results, and only the experimental results of scenario are slightly lower than those of the wiener filtering method, and the differences are slight, 0.82 and 0.83, respectively. Specific to the experimental results achieved with PESQ as the evaluation index, four of the seven scenes achieve the optimal results. As indicated from the comprehensive experimental results, the enhancement of the speech signal obtained by separating CNMF+JADE algorithm using wavelet transform is very effective. For the mentioned reason, this is one of the motivations for choosing wavelet transform as the speech enhancement method in this study.

In addition, the results of the experiments in which the wavelet transform method is used are compared with the results of the experiments in which the speech enhancement method is not used. It can be found that not all the speech quality is enhanced after speech enhancement. For instance, specific to scenario , the speech quality obtained after using the wavelet transform method decreases in all cases. For the experimental results achieved by using wavelet transform as the speech enhancement method, a total of 11 results out of 7 scenes and 21 results are better than the experimentally achieved results without the speech enhancement method.

For this reason, it can be concluded that the purpose of speech enhancement is to remove the noise in the speech segment and thus improve the quality of speech. However, during speech enhancement, the speech signal is corrupted to a certain extent, so the speech quality turns out to be not necessarily better after speech enhancement.

Thus, it is very important and necessary to adaptively select the speech signals that should be enhanced, instead of blindly enhancing all signals. For this reason, this study proposes an adaptive wavelet transform method that adaptively selects the enhanced speech signals and filters out the speech signals that are not required to be enhanced. The specific experimental validation is presented in the next section.

4.3. The Second Experiment Verification for Enhancement

In this part of the experiments, the adaptive wavelet transform enhancement method proposed in this study is validated. Again, the enhanced speech signal is acquired from the speech signal obtained after separation using the CNMF+JADE method. Moreover, the experimental results of the three metrics are verified separately. The achieved experimental results are listed in Tables 46, which fall to three parts, i.e., CNMF+JADE for the experimental results without enhancement and CNMF+JADE+wavelet transform for the experimental results with wavelet transform. Lastly, the adaptive wavelet transform method proposed here is adopted to evaluate whether the speech signal should be enhanced in each scene. From the experimental results in Tables 46, we can see that 0 is the experimental result without enhancement, and the corresponding experimental results with wavelet transform enhancement have decreased. 1 is the experimental result with enhancement, and the corresponding experimental results with wavelet transform enhancement have improved.

It is demonstrated through experiments that our adaptive judgment method can filter out the speech segments whose quality will be degraded after wavelet transform. As revealed from the results, the adaptive wavelet transform speech enhancement method proposed in this study can automatically filter the speech segments that are not suitable for speech enhancement, thus effectively improving the quality of the final speech signal.

4.4. Compared with the Deep Learning

In recent years, with the development of deep learning, researchers have noticed that the nonlinear processing and feature learning capabilities of deep models have significant advantages in addressing speech separation problems. Thus, in this part of the experiments, we implemented a cyclic stacking neural network (Ref [66]) to perform separation processing of the acquired speech signals. In Ref, the speech separation results of various deep neural networks are compared, which are close to the work in this study. We use two metrics, PESQ and STOI, to evaluate the quality of the separated speech signal to compare the performance of the proposed algorithm with deep learning algorithms. Comparing the results of the proposed speech separation methods, we can dig out the advantages and disadvantages of the shallow and deep models.

The experimental results of the proposed algorithm and the deep learning algorithm are shown in Table 7. From the experimental results, we can see that there is still a gap between the method proposed in this study and the deep learning method. In terms of PESQ index, the improvement of RDSN is obviously better than the method in this study. As indicated from the experimental results achieved with STOI as the evaluation index, the optimal value of the proposed method in this study is 0.106, which is the same as the experimental result of DDN, and the difference with the experimental result of RDSN is not much, only 0.006.

As indicated from a comprehensive analysis of the experimental results, the deep model outperforms the shallow model in the supervised case. However, the deep model requires considerable training data, and a large amount of speech data are very difficult to obtain. In addition, the deep model is more expensive to train, and it is difficult to achieve small-sample, unsupervised speech separation in complex scenarios. The speech separation algorithm proposed in this study can satisfy the needs of small sample and unsupervised speech separation. In addition, the total computational overhead of the shallow model is smaller than that of the deep model. As opposed to the deep model, the shallow model is more suitable for application scenarios with high real-time requirements. Given the comparison of the two models synthetically, the algorithm proposed in this study is considered to be more suitable for target speaker speech extraction in the complex multispeaker scenario.

5. Conclusion

The development of IoT technology promotes the rapid development of intelligent voice systems, and the efficient processing of signal data acquired by speech sensors becomes imminent. Thus, an unsupervised speech separation algorithm based on the combination of CNMF and JADE is proposed in this study. Through simulation experiments, it is well demonstrated that the proposed algorithm can effectively separate the target speech signals contained in the mixed speech signals. In addition, for the separated speech signal is weak and out of frame, this study also proposes an adaptive wavelet transform method to enhance the separated speech signal. As revealed from the results, the proposed algorithm in this study can enhance the separated speech signals. The comprehensive experimental results can prove that the proposed algorithm is very competitive in the processing of single-channel mixed speech separation problem. The algorithm is highly versatile and robust, capable of technically supporting other researchers in processing highly noisy signal data collected by sensors.

Speech separation, especially single-channel speech separation, has been a hotspot and difficult research area. In addition, as IoT technology is being developed and applied, separating high-quality speech signals has become an urgent task. Speech signals exhibit obvious spatio-temporal structures and nonlinear relationships, and most of the conventional speech classification methods are shallow structures, and the mentioned results are more limited in their ability to tap into the mentioned nonlinear structural information. In recent years, as deep learning is advancing, it has been suggested that the nonlinear processing and feature learning capabilities of deep models exhibit obvious advantages in addressing speech separation problems. Moreover, some results of processing speech signals with deep learning have been published. As deep learning computing is leaping forward, deep models (e.g., DNN, DSN, CNN, RNN, Deep NMF, and LSTM) will definitely be more competitive in speech separation problems. In the future, the use of deep learning techniques in speech separation will definitely become a research hotspot.

Data Availability

We are using the TIMIT dataset, which can be found at https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3/techamp; hit =1amp; filelist =1.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61902232, 61902231), the Natural Science Foundation of Guangdong Province (2019A1515010943), the Key Project of Basic and Applied Basic Research of Colleges and Universities in Guangdong Province (Natural Science) (2018KZDXM035), the Basic and Applied Basic Research of Colleges and Universities in Guangdong Province (Special Projects in Artificial Intelligence) (2019KZDZX1030), and the 2020 Li Ka Shing Foundation Cross-Disciplinary Research Grant (2020LKSFG04D).