Abstract

Gene splicing site recognition is a very important research topic in smart healthcare. Gene splicing site recognition is of great significance, not only for the large-scale and high-quality computational annotation of genomes but also for the analysis and recognition of the gene sequences evolutionary process. It is urgent to study a reliable and effective algorithm for gene splice site recognition. Traditional Twin Support Vector Machine (TWSVM) algorithm has advantages in solving small-sample, nonlinear, and high-dimensional problems, but it cannot deal with parameter selection well. To avoid the blindness of parameter selection, particle swarm optimization algorithm was used to find the optimal parameters of twin support vector machine. Therefore, a Particle Swarm Optimization Twin Support Vector Machine (PSO-TWSVM) algorithm for gene splicing site recognition was proposed in this paper. The proposed algorithm was compared with traditional Support Vector Machine algorithm, TWSVM algorithm, and Least Squares Support Vector Machine algorithm. The comparison results show that the positive sample recognition rate, negative sample recognition rate, and correlation coefficient (CC) of the proposed algorithm are the best among the four different support vector machine algorithms. The proposed algorithm effectively improves the recognition rate and the accuracy of splice sites. The comparison experiments verify the feasibility of the proposed algorithm.

1. Introduction

With the development of big data, Internet of Things, cloud computing, artificial intelligence, and other information technologies, smart healthcare has emerged [1]. Among the many research fields of smart healthcare, gene splicing site recognition is a very important research field. In the promotion of the Human Genome Project (HGP), the volume of data in molecular biology has exploded [2]. Besides, high-throughput sequencing technologies are developed, and the cost of sequencing is reduced, which lead to the accumulation of massive genome data and the crazy growth rate of the data [3]. Now, biological genetic resources are greatly enriched, but some new challenges are also brought to people. Traditional recognition methods no longer meet the current needs of processing massive genome data. It has become a research trend to use computer as a tool to study genome data with the help of machine learning theory. The application of machine learning technology in the field of genes has also been an important means to promote smart healthcare.

Genome research has entered the postgenome era. At this time, the focus of gene research has gradually shifted from gene sequencing to gene expression analysis [4]. Accurate recognition of splicing sites from gene sequences is very important for understanding gene expression. If the splicing sites in the gene sequences of eukaryotic organisms are accurately recognized, the expressed and nonexpressed regions of genes are distinguished. A splicing site is a short and highly consistent sequence of conservative features, located at the junction between exon and intron in the eukaryotic cell, as shown in Figure 1. Most splicing sites conform to the rule of “GT-AG” [5, 6]. At present, a difficult problem is how to recognize the correct splicing site from a large amount of gene data.

The recognition of gene splicing sites is a sequence recognition problem, which belongs to the intersection of biological information and sequence recognition. Some researchers have tried many algorithms of machine learning to recognize gene splicing sites and made great progress. Sharma et al. [7] proposed an acceptor site recognition method, which combined Adaptive Short Time Fourier Transform (ASTFT), period-3 measure, and with principal component analysis algorithm. Morfino et al. [8] dealt with the splicing site recognition problem in DNA sequences by using supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. L. Wang et al. [9] proposed a recognition algorithm on slicing site based on improved SVM. The recognition sensitivity of SVM was optimized by kernel principal component analysis. R. Wang et al. [10] designed SpliceFinder based on the Convolutional Neural Network (CNN) to recognize splice sites. They used human genomic data to train a neural network. An iterative approach was adopted by them to reconstruct the dataset, which tackled the data unbalance problem and forces the model to learn more features of splice sites. Pashaei and Aydin [11] proposed the Markovian encoding models in human splice site recognition using SVM algorithm and developed the MMSVM algorithm that is a web tool to recognize splice sites in any human sequence. Javier et al. [12] presented a methodology for combining many sources of information to recognize any functional site using ‘floating search,’ a powerful heuristic applicable when the cost of evaluating each solution is high. Waseem et al. [13] developed a computational model for splicing sites identification. The three feature extraction methods were employed (i.e., DNC, TNC, and TetraNC) to extract features from DNA sequences and are then combined to develop a composite features space. Scalzitti et al. [14] developed the Spliceator to predict splice sites in a wide range of species, including model and nonmodel organisms. Spliceator used a convolutional neural network and is trained on carefully validated data from over 100 organisms. Zhang et al. [15] introduced the DeepPN, a deep parallel neural network that is constructed with a convolutional neural network (CNN) and graph convolutional network (GCN) for detecting RBPs binding sites. Ghazanfari et al. [16] used all the data and valuable information such as isoform sequences, expression profiles, and gene ontology graphs and proposes a comprehensive model based on deep neural networks. Shen et al. [17] proposed the CircCNN to predict pre-mRNA back-splicing sites. The convolution neural network and batch normalization were the main parts of CircCNN. However, although these algorithms realize the recognition of gene splicing sites to some extent, the computational cost and recognition accuracy of these methods are still not satisfactory.

Twin Support Vector Machine (TWSVM) is a machine learning algorithm based on statistical learning theory [18]. As a deformation algorithm of SVM, TWSVM algorithm inherits the excellent learning ability of SVM algorithm, moreover, the operating efficiency of TWSVM algorithm is improved by 4 times. However, there are still some shortcomings in TWSVM algorithm, one of which is that TWSVM algorithm cannot deal with parameter selection well. Parameter selection has a great influence on recognition results. Inappropriate parameters will seriously reduce the recognition accuracy of TWSVM algorithm for gene splicing sites. The Particle Swarm Optimization (PSO) algorithm has a good effect on parameter optimization. Therefore, the PSO algorithm was combined with the TWSVM algorithm in this paper to overcome the shortcomings of TWSVM algorithm. An improved gene splicing site recognition algorithm based on the Particle Swarm Optimization Twin Support Vector Machine (PSO-TWSVM) was proposed. The proposed algorithm not only solved the problem of parameter selection but also further improved the prediction accuracy of the traditional algorithm. Firstly, splicing site recognition was regarded as a machine learning problem of dichotomy. The recognition was completed by analyzing the sequence characteristics near the splicing sites. Then, aiming at the difficult problem of parameter setting of TWSVM algorithm, the PSO algorithm was used to optimize the parameters of TWSVM algorithm to further improve the recognition performance of splicing sites.

The rest of this paper was organized as follows. Some background information was introduced in Section 2, including the TWSVM algorithm and the PSO algorithm. To overcome the shortages of TWSVM, an improved algorithm of PSO-TWSVM was proposed and the recognition steps of the proposed PSO-TWSVM algorithm for gene splicing sites were shown in Section 3. Comparative experiments of the proposed PSO-TWSVM algorithm, traditional TWSVM algorithm and Least Squares Support Vector Machine (LSSVM) algorithm were performed, and the experimental results were analyzed in Section 4. Finally, a conclusion of this paper was made, and some future research work was given in Section 5.

2.1. TWSVM Algorithm

On the basis of Proximal Support Vector Machine based on Generalized Eigenvalues (GEPSVM), the TWSVM algorithm was proposed by Jayadeva et al. in 2007 [19]. Different from traditional SVM algorithm, the TWSVM algorithm aims to find a pair of uneven hyperplanes by solving two small quadratic programming problems [20, 21]. In real life, the recognition of gene samples is a very difficult problem. Linear TWSVM algorithm cannot accurately analyze the feature space of gene data. Kernel function was introduced into TWSVM algorithm to solve nonlinear analysis and recognition problem [22].

Assume that represents the training sample set in the real space , where and . is the total number of samples, and , where is the number of positive samples and is the number of negative samples. Then, the nonlinear hyperplane of TWSVM algorithm is calculated as follows [23]. where is the normal vector of the hyperplane, is the offset sample, and is the kernel function. Now, the GAUSSIAN radial basis function is used as the kernel function of the TWSVM algorithm. , where is the positive sample matrix, and is the negative sample matrix. Then, the plane dividing positive samples and negative samples is obtained by solving two quadratic programming problems [24]. where , , , and are the penalty factors which are used to control the degree of punishment for missampling. To simplify Formulas (2) and (3), the dual transformation is carried out. where and . The following results are obtained by solving Formulas (4) and (5).

According to Formulas (6) and (7), the hyperplane is obtained, and its decision function is shown in Formula (8).

Compared with the SVM algorithm, TWSVM algorithm has better recognition ability. When the data is unbalanced, that is, the numbers of positive samples and negative samples are much different, SVM often fails to achieve the expected results. However, TWSVM still performs well in this case, mainly because the TWSVM has two penalty factors in the hyperplane, which can adjust the penalty degree of the hyperplane at the same time [25]. The recognition results of the TWSVM algorithm are closer to the true values of the sequence. In addition, as TWSVM algorithm solves two SVM problems, its computing efficiency is higher. The time complexity of SVM algorithm is , while that of TWSVM algorithm is , where is the number of samples. The computing overhead of TWSVM algorithm is about of that of the SVM algorithm.

2.2. PSO Algorithm

The Particle Swarm Optimization (PSO) algorithm was proposed by Eberhart, an American electrical engineer, and Kennedy, a social psychologist, in 1995 [26]. PSO algorithm is a process of continuous optimization by simulating the foraging behavior of birds, starting from random solutions, and iteratively updating the speed and position of particles [27]. Particles change their flight speed and position by learning their own experience (personal optimization) and social experience (global optimization) [28].

Assume represents the population size of the search space. Then, the position of the th particle is expressed as and the velocity of the th particle is expressed as , respectively. And the updating formula of the th particle are shown below. where and are the velocity and the position of the th particle of the in iteration, respectively, and are the positions of personal optimal particle and population optimal particle in iteration, respectively. is the inertia weight, and are learning coefficients, and and are random numbers that uniformly distributed between . and are used to regulate the maximum stride length of particles flying to the personal optimization, while and are used to regulate the maximum stride length of particles flying to the global optimization.

To effectively adjust the global and local search capabilities of the algorithm, linear reduction is widely used. where is the current number of iterations, is the total number of iterations, and and are the maximum weight and the minimum weight, respectively.

3. PSO-TWSVM Algorithm for Gene Splicing Site Recognition

3.1. PSO-TWSVM Algorithm

The TWSVM algorithm has advantages in solving small-sample, nonlinear, and high-dimensional problems. However, the parameter selection of the TWSVM algorithm is a difficult problem. According to the calculation process of TWSVM algorithm, there are three parameters to be determined, including penalty factor , penalty factor , and kernel parameter of GAUSSIAN radial basis function . The calculation formula of GAUSSIAN radial basis kernel function is shown as follows.

When is determined, finding appropriate parameter values of , , and is very important to the performance of TWSVM algorithm, which directly affects the recognition accuracy of gene splicing sites. The effects of parameters to TWSVM algorithm are analyzed as below.

3.1.1. Penalty Factors and .

The function of penalty factors and is to achieve a certain compromise by adjusting the balance between confidence risk and empirical risk of TWSVM algorithm, so that the generalization ability of TWSVM algorithm reaches the best [29]. When the values of and are small, it means that the penalty degree to the empirical error is small. At this point, the complexity of TWSVM algorithm is low, and the data fitting degree of TWSVM algorithm is weak. On the contrary, when the values of and are large, the complexity of TWSVM algorithm is high, and the data fitting degree of TWSVM algorithm is strong. Nonetheless, overfitting is easy to occur, and the generalization ability of TWSVM algorithm is weak.

3.1.2. The Kernel Function Parameter .

The kernel function parameter determines the learning degree of TWSVM algorithm and affects the distribution complexity of sample data in high-dimensional space [30]. When the value of is small, the output interval corresponding to the sample is small, and the empirical risk corresponding to the optimal hyperplane is small. At this point, the structural risk of TWSVM algorithm is large and TWSVM algorithm is prone to overfitting. On the contrary, when the value of is large, the model of TWSVM algorithm is complex and does not generalize well.

The parameters of TWSVM algorithm play a key role in the stability and generalization of the recognition model. Cross-validation is usually adopted for parameter selection in TWSVM algorithm, but the effect is not ideal [31]. The parameter selection of TWSVM algorithm is a dynamic optimization process. Fortunately, PSO algorithm can find the global optimal solution with high probability in the process of multiobjective optimization. Besides, the structure of PSO is simple. The calculation efficiency, the solution accuracy and the robustness of PSO algorithm are good. Therefore, the Particle Swarm Optimization Twin Support Vector Machine (PSO-TWSVM) algorithm is proposed to use PSO algorithm optimize the parameters of TWSVM and to further improve the recognition accuracy of splicing sites.

The main idea of the PSO-TWSVM algorithm is to select appropriate values of penalty factor , penalty factor , and the kernel function parameter of TWSVM by PSO optimization, and then establish PSO-TWSVM model through the optimal parameters and the training sample. The PSO-TWSVM algorithm takes the penalty factors and the kernel function parameter as the initial positions of particles. After the optimal parameters are found by PSO algorithm, the mathematical model of TWSVM for gene splicing site recognition is established. The flow chart of PSO-TWSVM algorithm is shown in Figure 2. The calculation steps of PSO-TWSVM algorithm are shown as follows.

Step 1. A particle swarm is initialized. The position of the th particle is expressed as a three-dimensional vector , where and represent two penalty factors, respectively, and represents the kernel function parameter .

Step 2. The training samples are premodeled, and the estimated error of each particle is taken as fitness function.

Step 3. According to the particle fitness, the personal optimal value and the global optimal value are updated.

Step 4. The speed and the position of the particle swarm are updated.

Step 5. Determine whether the termination condition is met, that is, whether the maximum number of iterations is reached: If it is met, the optimal parameter value of TWSVM algorithm is obtained; otherwise, return to Step 2.

Step 6. The PSO-TWSVM model is established by using the optimal parameters and training samples.

3.2. Application of PSO-TWSVM Algorithm for Gene Splicing Site Recognition

The proposed PSO-TWSVM algorithm is applied for gene splicing site recognition. First, the PSO-TWSVM model is trained using the training samples of gene splicing sites. Then, the trained PSO-TWSVM model is used to recognize the test samples of unknown gene splicing sites. Gene splicing site recognition model based on PSO-TWSVM algorithm is shown in Figure 3. The recognition steps of PSO-TWSVM algorithm for gene splicing sites are shown as follows.

Step 1. The gene data are divided into training samples and test samples.

Step 2. Both the training samples and the test samples are coded using One-Hot Encoding to convert the gene data into downstream numerical representations that can be processed by machine learning. The basic principle of One-Hot Encoding is to convert the variable to binary representation containing only 0 and 1, which is widely used in biological sequence processing. Since each DNA sequence is composed of four deoxynucleotides of adenine (A), guanine (G), cytosine (C), and thymine (T), the four-digit unique One-Hot Encoding is adopted. The deoxynucleotides of “A,” “G,” “C,” and “T” are encoded as “1000,” “0100,” “0010,” and “0001,” respectively.

Step 3. Impurity items are removed from both the training samples and the test samples. Very few impurity items other than “A,” “G,” “C,” and “T,” such as “N,” “Q,” and “W,” may be mixed in the gene data. If these impurity items are deleted, they may lead to the unauthenticity of the genetic data and increase the meaningless workload. Therefore, the impurity items are represented by the One-Hot Encoding “0000,” which not only removes the impurity items but also ensures the consistency of the data structure.

Step 4. Training samples are used to train the PSO-TWSVM model, so as to obtain the correct PSO-TWSVM model suitable for this set of gene data.

Step 5. Test samples are used to test the recognition performance of PSO-TWSVM model for such gene data.

4. Comparative Experiments of Gene Splicing Site Recognition

4.1. Description of the Dataset

The experiments were performed on the well-known benchmark Homo Sapiens Splice Site Dataset , which is available at URL: http://www.sci.unisannio.it/docenti/ to assess the recognition performance of the proposed PSO-TWSVM algorithm. is the sequence dataset of gene splicing sites extracted from GeneBank DNA sequence database by Pollastro from Italy [32]. The length of each DNA sequence in the dataset is 140 nucleotides. The DNA sequences follow the GT-AG rules: GT are located at sites 71 to 72, and AG are located at sites 69 and 70. The dataset is divided into four parts: true exon to intron, false exon to intron, true intron to exon, and false intron to exon.

In the experiment, 12,600 DNA fragments that met the GT-AG rules were selected from the database, including 3,200 splicing acceptor sites for training, 3,200 splicing acceptor sites for testing, 3,100 splicing donor sites for training, and 3,100 splicing donor sites for testing. The experimental sample data parameters are shown in Table 1. The samples were coded using One-Hot Encoding in order of 140 nucleotides.

4.2. Analysis of Experimental Results

Experimental parameters were set as population size was 40, acceleration coefficient was 2.0, and maximum number of iterations was 200. The optimal parameters (i.e., penalty factor , penalty factor , and kernel parameter of GAUSSIAN radial basis function ) were calculated by PSO algorithm, and then these optimal parameters and training samples of gene splicing sites were input for testing. The proposed PSO-TWSVM algorithm was compared with traditional SVM algorithm, TWSVM algorithm, and Least Squares Support Vector Machine (LSSVM) algorithm.

In the experiment, positive sample recognition rate, negative sample recognition rate, and correlation coefficient (CC) were used to evaluate the recognition effects of these four splicing sites recognition models. The calculation formula of is shown as follows [33]. where is the positive samples that are correctly recognized, is negative samples that are correctly recognized, is the positive samples that are wrongly recognized, and is the negative samples that are wrongly recognized.

The recognition results of different support vector machines on acceptor sites are shown in Table 2. The recognition results of different support vector machines on donor sites are shown in Table 3.

It can be seen from Tables 2 and 3 that the values of the four different support vector machines for gene splicing site recognition reached more than 76%, which verified the excellent performance of support vector machines in the field of sequence recognition. The positive sample recognition rates, negative sample recognition rates, and the values of the proposed PSO-TWSVM algorithm and TWSVM algorithm are higher than those of SVM algorithm and the LSSVM algorithm. This shows that the TWSVM has better recognition performance than SVM and LSSVM when solving nonlinear problems. A twin support vector machine is more suitable for gene splicing site recognition than the other two support vector machines. Moreover, the PSO algorithm is conducive to the more accurate parameter selection of a twin support vector machine, instead of blindly looking for parameters. PSO-TWSVM algorithm finds the optimal parameters of a twin support vector machine, which makes it achieve the optimal recognition ability. Among the four different SVM algorithms, the proposed PSO-TWSVM algorithm obtained the highest positive sample recognition rates, the highest negative sample recognition rates, and the best values on both acceptor sites and donor sites (as shown in Figure 4). To sum up, the PSO-TWSVM algorithm has a good recognition performance for gene splicing sites.

5. Conclusions

A Particle Swarm Optimization Twin Support Vector Machine (PSO-TWSVM) algorithm was proposed to recognize gene splicing sites, which is conducive to the development of smart healthcare. The Particle Swarm Optimization (PSO) algorithm optimizes the parameter of twin support vector machine, making the selection of parameters more accurate instead of blindly looking for parameters. Experimental results show that compared with SVM algorithm, TWSVM algorithm, and LSSVM algorithm, the PSO-TWSVM algorithm has the highest recognition rates on both acceptor sites and donor sites for splicing site recognition. However, although PSO-TWSVM algorithm shows excellent performance on gene splicing site recognition, it is sensitive to noise. Next, further research will be carried out on how to reduce noise interference. On the basis of the research of this method, we will combine the proposed algorithm with deep learning algorithms to further improve the recognition results of gene splicing sites.

Abbreviations

SVM:Support vector machine
TWSVM:Twin support vector machine
LSSVM:Least squares support vector machine
PSO:Particle swarm optimization
PSO-TWSVM:Particle swarm optimization twin support vector machine
HGP:Human genome project
CC:Correlation coefficient.

Data Availability

The data can be available upon request to the corresponding author.

Disclosure

An earlier version of this work has been presented as a preprint according to the following link https://www.researchsquare.com/article/rs-417904/v1 [34], which is not published. The earlier version was revised to form the current version. This work is subject to the latest version, copyrighted by the Wireless Communications and Mobile Computing.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This work is partially supported by the Digital Media Art, Key Laboratory of Sichuan Province, Sichuan Conservatory of Music, Project No.: 21DMAKL01; supported by the First Batch of Industry-University Cooperation Collaborative Education Project funded by the Ministry of Education of the People’s Republic of China, 2021, Project No.: 202101071001; supported by the Reproductive Health (Maternal and Infant Fertility/Development Health) Innovation Platform funded by the Science and Technology Projects of Fuzhou Health Commission, Project No.: 2019-s-wp; and supported by the Clinical Study of Gangliosides for Early Warning of Offspring Cognitive Development in Thyroid Disease during Pregnancy funded by the Fujian Natural Science Foundation of China, Project No.: 2022J01521.