Research Article | Open Access
A Novel Efficient Feature Dimensionality Reduction Method and Its Application in Engineering
In the engineering field, excessive data dimensions affect the efficiency of machine learning and analysis of the relationships between data or features. To render feature dimensionality reduction more effective and faster, this paper proposes a new feature dimensionality reduction approach combining a sampling survey method with a heuristic intelligent optimization algorithm. Drawing on feature selection, this method builds a feature-scoring system and a reduced-dimension length-scoring system based on the sampling survey method. According to feature scores and reduced-dimension lengths, the method selects a number of features and reduced-dimension lengths that are ranked in the front with high scores. This feature dimensionality reduction method allows for in-depth optimal selection of features and reduced-dimension lengths with high scores using an improved heuristic intelligent optimization algorithm. To verify the effectiveness of the dimensionality reduction method, this paper applies it to road roughness time-domain estimation based on vehicle dynamic response and gene-selection research in bioengineering. Results in the first case show that the proposed method can improve the accuracy of road roughness time-domain estimation to above 0.99 and reduce measured data of the vehicle dynamic response, reducing the experimental workload significantly. Results in the second case show that the method can select a set of genes quickly and effectively with high disease recognition accuracy.
The curse of dimensionality is a common problem in engineering. Ineffective and unreasonable feature dimensionality reduction will compromise machine-learning efficiency, pattern recognition accuracy, and data-mining efficiency while increasing the workload of measured data experiments to some extent. A set of high-dimension features possesses the following problems: too many features but few samples, too many features with few or no relations to the mining task, and excessive redundancy among features [1–4].
Feature dimensionality reduction methods can be classified into two types: feature selection and feature projection. Feature projection is also called feature reparameterization and transforms the original feature space into one with a smaller dimensionality and more independent dimensions [5–7]. Feature projection generally uses principal component analysis (PCA), kernel principal component analysis, or other improved principal component analysis methods for linear or nonlinear transformations. Although these methods can realize feature dimensionality reduction to some extent, they require extensive calculations and high computational complexity. These dimensionality reduction methods also cannot reduce necessary experimental data and are inconvenient when identifying features heavily correlated to mining tasks. By contrast, feature selection obtains low-dimension features more effectively compared to machine learning or data mining and carries low computational complexity [8–10]. However, existing feature selection methods are still time-intensive with respect to calculations, and some data dimensionality reduction problems in engineering remain difficult to solve using these methods.
The road roughness time-domain estimation of vehicle dynamic response is taken as an example in this paper, capturing feature dimensionality reduction in multivariate time series. Acquiring road roughness information based on vehicle dynamic response is an economical and practical method, but most researchers have only improved the precision of road roughness estimation by using different neural networks or enhanced estimation methods [11–14]; they have thus far failed to evaluate the types and quantities of dynamic responses to be assessed and the positions of points to be measured in the vehicle. Measurement points can change positions freely in the vehicle, and each position can offer three dynamic responses (vertical vibration acceleration, speed, and displacement); as such, the example has theoretically infinite feature dimensions. In this case, to improve the accuracy and speed of road roughness estimation based on neural networks, key features must be selected. Additionally, the subsequent experiment needs to test and measure the selected features, which guides the experiment and reduces the workload substantially.
For instance, in gene-selection research on cancer, the mining of gene expression profile data can identify cancer types. The gene expression profile has many data dimensions but small samples. For instance, leukemia has 7129 gene dimensions but only 72 samples for analysis, which increases the challenges of feature dimensionality reduction. Various methods exist for gene selection, but they are computationally costly and complex [15–18]. Many researchers have introduced information to facilitate gene selection, but these methods cannot properly reflect the length of a set of gene features, namely, the number of dimensions after dimensionality reduction. Overall, engineering projects in road roughness time-domain estimation based on vehicle dynamic response and biological gene selection suffer from a curse of dimensionality with an extensive number of features. In addition to their computational cost and complexity, common dimensionality reduction methods (e.g., PCA and cluster analysis) fail to effectively reduce dimensionality (i.e., experimental data) and thus cannot improve estimation accuracy. If researchers use feature selection, uncertainty persists regarding the number of features selected such that the length of the final feature set is unclear. Several variables must be measured to determine which are uncertain.
To solve the problem of excessive dimensions, this paper proposes a new feature selection method based on a sampling survey method and heuristic intelligent optimization algorithm. The general process is as follows: first, we build a feature-scoring system and a reduced-dimension length-scoring system based on the sampling survey method; second, we sort features and reduced-dimension lengths according to their scores and select the features and reduced-dimension lengths that are ranked in the front; third, to apply selection optimization to features and reduced-dimension lengths simultaneously, we redefine the meaning of the information carried by individuals in the heuristic intelligent optimization algorithm population and improve the algorithm accordingly. To better explain and verify the effect of the dimensionality reduction method, this paper conducts feature dimensionality reduction using road roughness time-domain estimation of vehicle dynamic response and gene-selection research in bioengineering. Results show that the new feature dimensionality reduction method proposed can select features quickly and effectively: the accuracy of road roughness time-domain estimation was generally higher than 0.99 and effectively reduced vehicle dynamic response measurement data, and a reasonable number of genes was selected.
2. New Feature Dimensionality Reduction Method Based on Sampling Survey Method and Improved Heuristic Intelligent Optimization Algorithm
2.1. Scoring System Based on Sampling Survey Method
The sampling survey method takes part of the population as samples for analysis . In a typical dimensionality problem, given a vast number of features, the final number remains unknown after dimensionality reduction; therefore, using enumeration to analyze all samples requires an intensive workload and may fail. By employing the sampling survey method and a certain sample scale, this paper analyzes the features and number of them after dimensionality reduction. All samples must have the same chance of being selected. To obtain useful information for feature dimensionality reduction, we propose a feature-scoring system and a scoring system for reduced-dimension length.
The feature-scoring system operates as follows: Step 1Draw sets of features with any number of dimensions taken randomly from features with equal probability. Calculate the fitness of each sample based on classification recognition accuracy, data-mining efficiency, or neural network estimation accuracy.Step 2Set a threshold value and select all samples with fitness larger than ; these samples comprise the set of valid feature scores.Step 3Score each feature based on the set of valid feature scores. The score of each feature is composed of the following two parts:where is the first part of the score (Score 1) of a feature, is the number of sets of valid feature scores, is the fitness ranking by the th set of valid feature scores, is the minimum of all Score 1s, and is the maximum of all Score 1s. where is the second part of the score (Score 2) of a feature; is a piecewise function such that if the th set of valid feature scores contains the feature, then and otherwise; is the minimum of all Score 2s; and is the maximum of all Score 2s.
The total score of a feature is represented by the following formula: where is the total feature score.
According to Formulas (1) and (2), and , strictly speaking, each belong to the range of [0, 1]. reflects the degree of contribution of a feature to fitness. reflects the frequency of a feature in a set of valid feature scores. A feature with a higher or can realize a better fitness value. Step 4Rank features according to their total scores and choose the first features for the set of valid features.
The scoring system for reduced-dimension length is as follows: Steps 1 and 2These steps are the same as in the feature-scoring systemStep 3Score the length of each reduced dimension according to the set of valid feature scores. The score of each dimension length is as follows:where is the total reduced-dimension length score, is the number of sets with the same dimension length in a set of valid feature scores, and is the fitness of each sample. Step 4Rank each dimension length according to , select those with a proportion of cumulative gross scores greater than 80% out of the total score of all dimension lengths, and identify reasonable dimension lengths after dimensionality reduction.
To verify that a scoring system based on the sampling survey method can be used to effectively select features with the greatest influence on results, this paper uses a calculation example for simple mathematical analysis: the classical problem of Hald cement. Hald cement contains four chemical components: 3CaO·Al2O3 (), 3CaO·SiO2 (), 4CaO·Al2O3·Fe2O3 (), and 2CaO·SiO2 (). Table 1 lists data on relations between heat (calorie) released by each gram of cement and the content of these four components.
In statistics, variance reflects the degree to which two or more data points are different. Then, using the idea of calculating the degree of contribution in PCA, this paper takes the proportion of variance of one variable from the total variance of all variables to represent the degree of influence of that variable: where is the degree of influence of the th feature, is the variance of the th feature, and is the total variance of all features.
Formula (5) can be used to calculate the degrees of influence of four components of Hald cement (, , , and ), which are 0.0579, 0.4050, 0.0686, and 0.4686, respectively. and have high degrees of influence, has less influence, and has the least influence. To further reflect the degree of influence of these variables on heat , we employed the radial basis function neural network (RBFNN) and considered several parts of the four components of Hald cement as input features to estimate , the heat released by each gram of each component. If a feature does not have typical representativeness, then the estimation accuracy of RBFNN will be low. Therefore, we can rank the four features in terms of their influence degree according to the estimation accuracy after RBFNN machine learning. Considering a single feature as input into the neural network and the coefficient of determination as the evaluation index for the degree of closeness of heat ’s predicted value to the actual value, we obtained coefficients of determination for , , , and as 0.7765, 0.9580, 0.8454, and 0.9495, respectively. The results are similar to the variance analysis; therefore, we can conclude that and exert great influences on heat , has less influence, and has the least influence.
According to Step 1 of the scoring system based on the sampling survey method proposed in this paper, we must also calculate the fitness of each sample by considering the accuracy of classification recognition, data-mining efficiency, or neural network estimation accuracy as the fitness to analyze the four features. To better compare their degrees of influence, we considered the output approximation degree of RBFNN when using the scoring system; Table 2 lists the results.
When ranking the four features by degree of influence in descending order, we obtained , , , and . The results of dimensionality length show that when the dimensionality is 2, we can estimate heat using the RBFNN with zero error. In this case, two features input into the neural network must contain either or . The above analysis confirms that the proposed scoring system based on the sampling survey method can be used to evaluate the degree of feature importance to select an ideal feature-set length.
2.2. Improving the Heuristic Intelligent Optimization Algorithm to Enhance Algorithm Execution Efficiency
This paper uses an artificial fish swarm algorithm (AFSA) and particle swarm optimization (PSO) to explain and verify the new feature dimensionality reduction method [20–24]. To improve the rate of convergence and precision of the algorithm, we refine AFSA and PSO.
AFSA is improved as follows: (1)Add the memory function to all AFSA behaviors and store historical optimal positions and corresponding optimal values of individuals and the entire fish swarm(2)Introduce the speed and position calculation formulas of PSO into swarm foraging behavior and update and calculate the movement speed and position of the next-generation fish swarm(3)According to the idea of variation in a genetic algorithm, if the global optimum remains the same in multiple records, we shall disturb multiple fish in the swarm with a certain probability
The PSO is improved as follows: (1)To avoid the prematurity phenomenon in PSO, introduce the probability judgment Metropolis rule of a simulated annealing (SA) algorithmwhere is the probability of accepting a poor solution, is the difference in fitness between two generations, and is the temperature variable in SA, which declines as the number of iterations increases. (2)Change the update criterion of temperature variable towhere is the value of of the th iteration, is the initial value of , is the final value of , and is the total number of iterations. (3)Taking the nonuniform mutation idea of a genetic algorithm as reference, disturb learning parameter of flight speed in PSO to intensify the local search ability. The following formula calculates learning parameter :where is the value of of the th iteration, is the value of of the th iteration, and is a constant, such that in this paper.
To theoretically confirm the excellent optimum searching performance of the improved heuristic intelligent optimization algorithm, this paper analyzes convergence of the improved PSO and improved AFSA. Solis and Wets  studied the stochastic optimization algorithm and provided convergence criteria for stochastic optimization. For a type of stochastic optimization algorithm to converge to the global optimal solution, the following two conditions must be satisfied: (1) Condition H1: , and if , then . In this condition, is the feasible solution space, is the stochastic optimization algorithm, is the result of the th iteration, , and is the solution searched by the algorithm in this iteration.
Condition H1 ensures that the fitness in each iteration of the stochastic optimization algorithm is nonincremental. The stochastic optimization algorithm with convergence ability can get sequence with a probability of 1, and the sequence converges to a certain value. A stochastic optimization algorithm satisfying condition H1 has convergence ability; although the final convergence value is not always a globally optimal solution, it may be locally optimal. (2) Condition H2: For ’s arbitrary Borel subset , if its measure >0, then . is the probability measure of the algorithm’s th iteration result in set .
Condition H2 indicates that if a stochastic optimization algorithm can search and find the globally optimal solution, the algorithm goes through the globally optimal solution at least once when the number of iterations approaches infinity; that is, the algorithm has ergodicity. The convergence of the improved PSO is confirmed below.
In the PSO, we define each particle by updating its speed and position based on the following formulas: where is the speed of a particle in the th iteration, is the inertia weight factor, and are learning factors, is the historical best position of an individual, is the historical best position of the swarm, and is the position of a particle in the th iteration.
The PSO stores the historical optimal position of a swarm, , thus ensuring that the optimization result obtained in each step of the algorithm is nonincremental; hence, the PSO satisfies condition H1.
Then, we judge the convergence of particle speed and position according to convergence in probability. With an inertia weight and the proposed learning factor calculation formula, when the number of iterations approaches infinity, we have , where is an arbitrarily small positive value. Suppose is a point of convergence in the PSO; then, we have .
If the number of iterations is sufficiently large, then the particle’s position and update speed will converge to and 0, respectively, with a probability of 1.
Additionally, we analyze and confirm the convergence of PSO using the difference equation . Although and are multidimensional variables, the dimensions are mutually independent; therefore, we simplify algorithm analysis to be a 1D case. To simplify the calculation, we suppose that , the position of optimal solution of a particle, and , the position of the optimal solution of the whole swarm, are constant.
It is a second-order constant-coefficient nonhomogeneous difference equation, and this paper uses the characteristic equation method to solve it.
Equation (13)’s characteristic equation is . A quadratic equation with one unknown quantity can be divided into the following three cases: , , and ().
If , then we have . In this case, we obtain the following equation: where is the initial value of the algorithm iteration and is the initial speed of the iteration.
If , then we have and . In this case, we get the following equations:
If , then we have and . ’s calculation formula is the same as the solution in the case of .
If has a limit when approaches infinity, then requires convergence in all three cases. We get and . Essentially, the PSO’s convergence conditions are , , and .
We propose an improved PSO to improve learning factors and . With an increase in the number of iterations, and approach 0. Therefore, the improved PSO surely satisfies the above convergence conditions only if the inertia factor has a proper value.
We prove the convergence of the improved PSO proposed using the three methods above. However, if the algorithm is required to converge to the global optimal solution, it must satisfy condition H2. In this case, the algorithm goes through the global optimal solution at least once when its number of iterations approaches infinity; that is, the particle in the PSO should demonstrate ergodicity. Apparently, standard PSO cannot guarantee identification of the global optimal value, resulting in the aforementioned prematurity phenomenon. We introduce SA into the evolutionary criteria of PSO because SA has been theoretically proven to converge to the global optimal solution with a probability of 1 in certain conditions .
Next, this paper proposes modifying the learning factor to provide the algorithm with a fast flying speed in early iterations and then obtaining a large search scope in the search space to search and find the global optimal solution when the search scope declines as the search proceeds. According to the derivation of reference , the standard AFSA is convergent. Proposed improvements introduce the speed and position calculation formulas of PSO into the foraging behaviors of the fish swarm. This way, in addition to clustering and tailgating behaviors, the entire fish swarm can share information about the historical optimal position of the swarm and the optimal positions of individuals during foraging behavior. Moreover, with the first improvement to AFSA (i.e., storing the historical optimal positions of individuals and the whole fish swarm and corresponding optimal values) and the convergence analysis of PSO, we can ensure that the improved AFSA satisfies the convergence judging condition H1 in the stochastic optimization algorithm.
Because we use the speed and position update formulas of PSO in the fish swarm’s foraging behaviors where the fish swarm may also exhibit clustering and tailgating behaviors, all of which move in the direction with higher fitness, we suppose that is the convergence point of AFSA. Then, we have where is an arbitrarily small positive value; therefore, the AFSA possesses convergence.
Similarly, to satisfy condition H2, the position of the fish swarm should have ergodicity in AFSA. Like PSO, the standard AFSA also suffers from the prematurity phenomenon. If a super individual appears in the early iterations of the algorithm, then the whole fish swarm will converge quickly to the value. Therefore, we propose a third AFSA improvement: if the historical optimal solution does not change in multiple iterations, then the algorithm randomly changes some individuals in the fish swarm. When the iteration approaches infinity, the fish swarm position will take all possible values in the definition domain with a theoretic probability of 1. Essentially, the improved AFSA is globally convergent with a certain probability.
To verify the computational complexity of the improved PSO and improved AFSA, we choose the following test function : where is an independent variable within the domain of [−1, 2].
For analysis, we used a Lenovo computer with an Intel(R) Core(TM) i5-4590 processor, CPU at 3.30 GHz, and 8 GB memory. The PSO and AFSA used 20 particles over 30 iterations. Figure 1 presents the algorithm test results.
The standard AFSA and standard PSO were trapped in the area of the local optimal value in early iterations and slowly identified the global optimal value as the number of iterations increased, whereas the improved AFSA and improved PSO quickly identified the area of global optimal value in early iterations and finally converged to the global optimal value. The improved PSO and improved AFSA spent only 0.0904 s and 0.0719 s, respectively, converging to the global optimal value, whereas the standard PSO and standard AFSA took 0.0990 s and 0.5321 s, respectively.
2.3. Improving the Heuristic Intelligent Optimization Algorithm Based on the Need for Feature Dimensionality Reduction
To apply the heuristic intelligent optimization algorithm to optimal feature selection, we improved the algorithm based on classification recognition accuracy, data-mining efficiency, or neural network estimation precision as the target value of algorithm optimization. We redefined the individuals of each population in the intelligent optimization algorithm (i.e., redefined information carried by either the fish swarm in AFSA or each particle in PSO). By imitating the transcription and translation process of genetic information in biology, information carried by individuals is defined as shown in Figure 2.
Here, is the first information of an individual within a value range of [0, 1]. is the total number of elements in the valid feature set. Table 3 lists reasonable dimension lengths after dimensionality reduction.
In this case, , , and represent reasonable dimension lengths after the first, second, and th dimensionality reduction (i.e., the number of reasonable features after dimensionality reduction), respectively. Parameters are , , , , , and . is equal to the proportion of the total score of out of the total score of all reasonable dimension lengths. Similarly, is equal to the proportion of the total score of out of the total score of all reasonable dimension lengths. The following is the calculation formula of : where and are the upper and lower limits , respectively, of the th reasonable dimension length; is the total score of , and is the total number of reasonable dimension lengths.
If the value of is within the interval of the th line in Table 1, then the corresponding reasonable dimension length is selected. Moreover, the heuristic intelligent algorithm will extract information from to according to the value of , and the information will be used to calculate the target value of the algorithm.
According to continuous iterative calculation of the heuristic intelligent optimization algorithm, the information carried by each individual will evolve in the direction of the optimal target value to obtain the most reasonable dimension length after dimensionality reduction and feature selection.
3. Road Roughness Time-Domain Estimation Based on Proposed Dimensionality Reduction Method
Researchers generally measure contacting or noncontacting road roughness to determine roughness; however, this method returns measurement results slowly and requires expensive equipment. A method combining a dynamic response and neural network can quickly estimate road roughness. Yet little research exists about the types and quantity of dynamic response and the points to be measured in the vehicle. Because measurement points can change positions freely in the vehicle, and each position can offer three dynamic responses (i.e., vertical vibration acceleration, speed, and displacement), this example theoretically has infinite feature dimensions.
3.1. Building a Vehicle Vibration Model with Freely Changing Measurement Points in the Vehicle
Assume the vehicle is symmetric along the longitudinal axis, and the left and right wheels are under the same road conditions while driving. Given these assumptions, we simplify the vehicle model into a half-vehicle model and the vehicle body into a rigid rod. The vibration model in this paper only considers degrees of freedom: the vehicle body’s vertical vibration, pitch-angle vibration, vertical vibration of the seat system, vertical vibration of front wheel, and vibration of the rear wheel (Figure 3). We consider the points in the vehicle, of which the distance to the center of mass is , as the dynamic response measurement points and build the vibration equation based on Lagrange’s second equation.
In Figure 3, , , , , and denote the vertical vibration displacements (m) of the front wheel, rear wheel, seat system, center of mass of the vehicle body, and pitch-angle displacement of the vehicle body (rad), respectively; , , , , and are the mass of the front wheel, rear wheel, seat system, vehicle body (kg), and rotational inertia of the vehicle body’s mass center around the horizontal axis (kg·m2), respectively; , , , and represent the distance from the front axle, back axle, seat system, and measurement point to the center of mass (m), respectively; , , , , and are the damping of the front suspension, rear suspension, seat system, front wheel, and rear wheel (N·s/m), respectively; , , , , and are the rigidity of the front suspension, rear suspension, seat system, front wheel, and rear wheel (kN/m), respectively; and and are the vertical displacement of road roughness to the front wheel and rear wheel (m), respectively.
The following is Lagrange’s second equation: where denotes the generalized coordinates; represents the difference between the expression of kinetic energy and potential energy ; is the dissipated energy, referring to energy consumed by the damping element; and is the generalized force.
Because the pitch angle has a smaller vibration, the paper takes . We assume the vertical vibration displacement of the measurement point to be and the distances from the measurement point to the front axle and the rear axle to be and , respectively.
The kinetic energy of the system is
The potential energy of the system is
The dissipated energy of the system is
According to the vibration simulation model, we can obtain dynamic response data conveniently. The vibration model uses the filtered white noise method to produce the road roughness signal, meeting experiment requirements, and considers a level-B road as the road to be estimated.
3.2. Research on Feature Dimensionality Reduction in Road Roughness Estimation Using the Proposed Method
This example uses the improved AFSA for optimal feature selection. We number the vertical accelerated speed, vertical speed, and vertical displacement of the positions of measurement point, center of the front wheel, center of the rear wheel, and center of the driver’s seat as shown in Table 4.
Numbers 2, 7, and 12 are the pitch-angle accelerated speed, pitch-angle speed, and pitch angle of the measurement point, respectively.
The example uses the improved AFSA and assumes information carried by the th fish to be . where is the value of the first information, within the range of [0, 1]; are positive integers within [1, 15], the values of which represent the corresponding dynamic response numbers; and is the distance from the measurement point to the center of mass, , and in the example.
In the iteration process, the improved AFSA finds the corresponding reasonable dimension length from Table 3 according to the value of and selects the first elements from accordingly.
The example uses the RBFNN for time-domain estimation of road roughness. We used the improved AFSA and conducted 12 total optimization experiments; results appear in Table 5.
In Table 5, is the coefficient of determination. The feature dimensionality reduction method proposed can choose features with great influence on the precision of road roughness estimation. The sets of features selected in Experiments 3 and 4 were applied in Experiments 7 and 9; thus, we used the features selected in Experiments 3 and 4 for time-domain estimation of road roughness. Figures 5 and 6 display the estimation results.
The actual value was close to the estimated value. The coefficient of determination exceeded 0.99 in both tests, indicating that the proposed feature dimensionality reduction method can select a feature set with high contributions to road roughness using only four sets of features; thus, we can accurately estimate road roughness using only four dynamic response values. To further confirm this result, we used the random forest method to estimate road roughness based on the results of Experiments 3 and 4. Estimation results are presented in Figures 7 and 8.
The road roughness estimation results obtained with feature sets from Experiments 3 and 4 based on the random forest method were 0.9489 and 0.9708, respectively. Although the estimation accuracy of the random forest method was slightly lower than that of RBFNN, it was still quite high, suggesting that the set of dynamic response features selected with the proposed method can match other estimation data methods perfectly. When the proposed feature selection method was combined with a prediction method, only a few dynamic response parameters were needed to estimate road roughness accurately.
4. Gene-Selection Research in Bioengineering Using the Proposed Dimensionality Reduction Method
The experimental subjects in the example consisted of four open microarray datasets on leukemia, the colon, SRBCT, and brain cancer. Leukemia  contained 72 samples, each with 7129 genes. Acute myeloid leukemia (AML) contained 25 samples, and acute lymphoblastic leukemia contained 47. The colon  contained 62 samples (40 of tumor tissue and 22 of normal tissue), each with 2000 genes. SRBCT  included 83 samples, each with 2308 genes: 29 samples of the Ewing family of tumors, 18 samples of neuroblastoma, 25 samples of rhabdomyosarcoma, and 11 samples of Burkitt lymphomas.
Brain cancer  contained 60 samples, each with 7219 genes: 46 samples of patients with classic brain cancer and 14 samples with desmoplastic brain cancer. Although brain cancer only had two classes, the dataset has proven difficult to classify in current research with classification accuracy lower than 85% in much of the literature. Lung cancer  included 203 samples, each with 3312 genes. Samples were classified into five subtypes: 139 adenocarcinoma, 20 pulmonary carcinoids, 21 squamous-cell lung carcinomas, six small-cell lung carcinomas, and 17 normal lungs. Lymphoma had 58 samples, each with 7129 genes. The samples included two subtypes: 32 cured patients and 26 not cured.
Because each type of tumor disease is related to many genes, we roughly extracted genes using the information index to classification (IIC) . This example uses the improved PSO and assumes information carried by the th particle to be . where is the value of the first information, within the range of [0, 1]; are positive integers within the range of [1, ], and their values represent corresponding gene numbers; and is the number of roughly extracted genes based on IIC.
The example uses the extreme learning machine (ELM) to recognize tumor subtypes and takes the mean and standard deviations of the 5-fold CV accuracy of 100 experiments to verify the accuracy of extracted genes.
To demonstrate the advantage of the proposed method in sample classification accuracy, we chose five methods with high gene-selection accuracy, including binary particle swarm optimization BPSO-ELM, K-means-PSO-ELM, K-means-BPSO-ELM, BPSO-GCSI-ELM, and K-means-GCSI-MBPSO-ELM, to compare with the proposed method. Results of 100 independent repeated experiments were taken from ; Table 6 shows the comparison results.
BPSO is a PSO algorithm that solves discrete optimization problems using particles formed by the binary system . BPSO-ELM optimally selects disease-related genes using BPSO and evaluates the classification results of genes selected each time using ELM . K-Means-PSO-ELM and K-means-BPSO-ELM each complete clustering analysis on all disease-related genes using K-means first followed by optimal selection with standard PSO and BPSO combined with the ELM classifier . GCSI refers to gene-to-class sensitivity information; the BPSO-GCSI-ELM method generates an initial particle swarm using the GCSI of each gene and influences the updating and evolutionary process of particles . Combining the above methods, K-means-GCSI-MBPSO-ELM first completes K-means clustering analysis on all disease-related genes and then introduces the GCSI into the modified BPSO to optimally select genes in combination with the ELM classifier .
Because other gene extraction methods have demonstrated good effects on gene extraction of leukemia and SRBCT, this paper does not compare the results of these two types of diseases. Tables 6 and 7 indicate that we can select genes with greater influences on tumor subtype classification using the proposed feature dimensionality reduction method. Gene sets selected with the proposed method exhibited high classification accuracy, substantially higher than other methods. The standard deviation of classification accuracy of the proposed method was smaller than that of the other methods except brain cancer. For the four diseases, the average accuracy of the proposed method was higher than the highest average accuracy of the other methods by 1.25%, 3.84%, 1.57%, and 6.63%, respectively (from left to right in Table 6). For brain cancer and lymphoma, the average classification accuracy improved to approximately 93%; for colon and lung cancer, the average classification accuracy reached approximately 99%. The classification accuracy was fairly high, greatly improving the practicability of gene-based disease detection.
To compare the ELM classifier with other classifiers based on disease classification results, we selected the random forest classifier and support vector machine (SVM) classifier to identify tumor diseases based on the gene sets in Table 7. We used the average and standard deviation of 5-fold CV accuracy from 100 tests to evaluate the classification results of different classifiers. Table 8 shows the results of the random forest classifier and SVM classifier.
In terms of tumor disease recognition accuracy based on the selected gene sets using the proposed method, the random forest classifier and SVM classifier were generally poorer than the ELM classifier. For lung cancer, however, the SVM classifier’s average accuracy was slightly higher than that of the ELM classifier and its standard deviation was smaller.
(1)This paper proposes a new feature dimensionality reduction method based on a sampling survey method and heuristic intelligent optimization algorithm. First, we build a feature-scoring system and a reduced-dimension length-scoring system based on the sampling survey method and select features ranking in the front with reasonable dimension lengths. Then, we select features according to the improved heuristic intelligent optimization algorithm. Our method has only two steps with simple operation and low computational complexity.(2)We successfully apply the proposed feature dimensionality reduction method to road roughness time-domain estimation research. Results show that we only need as few as four dynamic responses to estimate road roughness accurately. The experimental results of 12 feature dimensionality reductions reveal the mean accuracy of road roughness time-domain estimation to be 0.9897.(3)We successfully apply the proposed feature dimensionality reduction method to gene-selection research in bioengineering. Results show that the gene sets selected with the proposed method have high classification accuracy with mean classification accuracy substantially higher than that of other gene-selection methods.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX17_0647) and National Key R&D Program of China (2016YFD0701103).
- A. Jalilvand and N. Salim, “Feature unionization: a novel approach for dimension reduction,” Applied Soft Computing, vol. 52, pp. 1253–1261, 2017.
- H. Itoh, A. Imiya, and T. Sakai, “Dimension reduction and construction of feature space for image pattern recognition,” Journal of Mathematical Imaging and Vision, vol. 56, no. 1, pp. 1–31, 2016.
- W. Zhao and S. Du, “Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 8, pp. 4544–4554, 2016.
- C. Liu, W. Wang, M. Konan et al., “A new validity index of feature subset for evaluating the dimensionality reduction algorithms,” Knowledge-Based Systems, vol. 121, pp. 83–98, 2017.
- J. Liu, “Feature dimensionality reduction for myoelectric pattern recognition: a comparison study of feature selection and feature projection methods,” Medical Engineering & Physics, vol. 36, no. 12, pp. 1716–1720, 2014.
- M. Yu, L. Shao, X. Zhen, and X. He, “Local feature discriminant projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1908–1914, 2016.
- H. X. Vo and L. J. Durlofsky, “Regularized kernel PCA for the efficient parameterization of complex geological models,” Journal of Computational Physics, vol. 322, pp. 859–881, 2016.
- S. Guo, D. Guo, L. Chen, and Q. Jiang, “A L1-regularized feature selection method for local dimension reduction on microarray data,” Computational Biology and Chemistry, vol. 67, pp. 92–101, 2017.
- J. Liu, Y. Lin, M. Lin, S. Wu, and J. Zhang, “Feature selection based on quality of information,” Neurocomputing, vol. 225, pp. 11–22, 2017.
- S. Goswami, A. Chakrabarti, and B. Chakraborty, “An efficient feature selection technique for clustering based on a new measure of feature importance,” Journal of Intelligent & Fuzzy Systems, vol. 32, no. 6, pp. 3847–3858, 2017.
- H. Imine, Y. Delanne, and N. K. M'Sirdi, “Road profile input estimation in vehicle dynamics simulation,” Vehicle System Dynamics, vol. 44, no. 4, pp. 285–303, 2006.
- A. González, E. J. O'brien, Y. Y. Li, and K. Cashell, “The use of vehicle acceleration measurements to estimate road roughness,” Vehicle System Dynamics, vol. 46, no. 6, pp. 483–499, 2008.
- Z. Q. Gu, Y. F. Zhu, S. Zhang, and X. K. Ma, “Identification of mining road roughness based on GA-BP neural network,” China Mechanical Engineering, vol. 25, no. 23, pp. 3232–3238, 2014.
- Y. C. Qin, J. F. Guan, L. Gu, Y. Li, and R. Liu, “Road profile input estimation using adaptive-neuro fuzzy inference system,” Transactions of Beijing Institute of Technology, vol. 35, no. 5, pp. 481–489, 2015.
- W. Sun and F. Han, “Gene selection method based on gene-to-class sensitivity information and binary particle swarm optimization,” Application Research of Computers, vol. 31, no. 9, pp. 2648–2664, 2014.
- F. Han, W. Sun, and Q. H. Ling, “A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information,” PLoS One, vol. 9, no. 5, article e97530, 2014.
- P. W. Novianti, V. L. Jong, K. C. B. Roes, and M. J. C. Eijkemans, “Meta-analysis approach as a gene selection method in class prediction: does it improve model performance? A case study in acute myeloid leukemia,” BMC Bioinformatics, vol. 18, no. 1, p. 210, 2017.
- Y. X. Wang, J. X. Liu, Y. L. Gao, C. H. Zheng, and J. L. Shang, “Differentially expressed genes selection via Laplacian regularized low-rank representation method,” Computational Biology and Chemistry, vol. 65, pp. 185–192, 2016.
- S. Liu and X. Liu, “A sample survey based linguistic MADM method with prospect theory for online shopping problems,” Group Decision and Negotiation, vol. 25, no. 4, pp. 749–774, 2016.
- W. Hu and F. Y. Xu, “Self-tuning of PID parameters based on improved particle swarm optimization,” Application Research of Computers, vol. 29, no. 5, pp. 1791–1794, 2012.
- J. S. Wang, J. C. Wang, and W. Wang, “Self-tuning of PID parameters based on particle swarm optimization,” Control and Decision, vol. 20, no. 1, pp. 73–81, 2005.
- A. H. Halassi, “An attractor-based multiobjective particle swarm optimization,” International Journal of Applied and Computational Mathematics, vol. 3, no. 2, pp. 1019–1036, 2017.
- C. Y. Wu, “A new improved artificial fish swarm optimization algorithm,” CAAI Transactions on Intelligent Systems, vol. 10, no. 3, pp. 1–6, 2015.
- Q. Duan, M. Mao, P. Duan, and B. Hu, “An improved artificial fish swarm algorithm optimized by particle swarm optimization algorithm with extended memory,” Kybernetes, vol. 45, no. 2, pp. 210–222, 2016.
- F. J. Solis and R. J. B. Wets, “Minimization by random search techniques,” Mathematics of Operations Research, vol. 6, no. 1, pp. 19–30, 1981.
- S. Gao, K. Tang, X. Jiang, and J. Yang, “Convergence analysis of particle swarm optimization algorithm,” Science Technology and Engineering, vol. 6, no. 12, pp. 1625–1631, 2006.
- Y. Feng, Research and Application of Simulated Annealing Algorithm, Kunming University of Science and Technology, Kunming, China, 2004.
- G. Huang, J. Liu, and Y. Yao, “Global convergence proof of artificial fish swarm algorithm,” Computer Engineering, vol. 38, no. 2, pp. 204–206, 2012.
- Z. Cheng and Z. Lu, “Research on the PID control of the ESP system of tractor based on improved AFSA and improved SA,” Computers and Electronics in Agriculture, vol. 148, pp. 142–147, 2018.
- Y. Wang, F. S. Makedon, J. C. Ford, and J. Pearlman, “HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data,” Bioinformatics, vol. 21, no. 8, pp. 1530–1537, 2005.
- U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
- J. Khan, J. S. Wei, M. Ringnér et al., “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.
- Y. Y. Leung, C. Q. Chang, Y. S. Hung, and P. C. Fung, “Gene selection for brain cancer classification,” in 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 5846–5849, New York, NY, USA, September 2006.
- Y. Li and R. Xiaogang, “Feature selection for cancer classification based on support vector machine,” Journal of Computer Research and Development, vol. 42, no. 10, p. 1796, 2005.
- J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” in 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, pp. 4104–4108, Orlando, FL, USA, October 1997.
- D. Tang, F. Han, and Z. Cheng, “Gene selection method based on gene scoring strategy and improved particle swarm optimization,” Computer Engineering and Design, vol. 39, no. 3, pp. 710–715, 2018.
- S. Yang, F. Han, and J. Guan, “A hybrid gene selection and classification approach for microarray data based on clustering and PSO,” Communications in Computer and Information Science, vol. 375, pp. 88–93, 2013.
- C. Yang and F. Han, “A gene selection method based on BPSO with prior information,” Software Guide, vol. 14, no. 7, pp. 36–40, 2015.
- F. Han, C. Yang, Y. Q. Wu et al., “A gene selection method for microarray data based on binary PSO encoding gene-to-class sensitivity information,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 1, pp. 85–96, 2017.
Copyright © 2018 Zhun Cheng and Zhixiong Lu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.