Computational and Mathematical Methods in Medicine

Volume 2015, Article ID 794586, 12 pages

http://dx.doi.org/10.1155/2015/794586

## Dimensionality Reduction in Complex Medical Data: Improved Self-Adaptive Niche Genetic Algorithm

^{1}Department of Biomedical Engineering, Zhejiang University, 38 Zheda Road, Hangzhou, Zhejiang 310027, China^{2}Guizhou Key Laboratory of Agricultural Bioengineering, Guizhou University, Guiyang, Guizhou 550025, China^{3}Zhejiang Hospital, Hangzhou, Zhejiang 310058, China

Received 24 July 2015; Revised 24 September 2015; Accepted 4 October 2015

Academic Editor: Anne Humeau-Heurtier

Copyright © 2015 Min Zhu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

With the development of medical technology, more and more parameters are produced to describe the human physiological condition, forming high-dimensional clinical datasets. In clinical analysis, data are commonly utilized to establish mathematical models and carry out classification. High-dimensional clinical data will increase the complexity of classification, which is often utilized in the models, and thus reduce efficiency. The Niche Genetic Algorithm (NGA) is an excellent algorithm for dimensionality reduction. However, in the conventional NGA, the niche distance parameter is set in advance, which prevents it from adjusting to the environment. In this paper, an Improved Niche Genetic Algorithm (INGA) is introduced. It employs a self-adaptive niche-culling operation in the construction of the niche environment to improve the population diversity and prevent local optimal solutions. The INGA was verified in a stratification model for sepsis patients. The results show that, by applying INGA, the feature dimensionality of datasets was reduced from 77 to 10 and that the model achieved an accuracy of 92% in predicting 28-day death in sepsis patients, which is significantly higher than other methods.

#### 1. Introduction

Clinical decision system is able to aid in diseases diagnosis and predict the clinical outcomes in response to treatment [1, 2]. For the diagnosis of sepsis, a number of scoring systems have been proposed, such as the Acute Physiology and Chronic Health Evaluation (APACHE), Sequential Organ Failure Assessment (SOFA), and Clinical Pulmonary Infection Score (CPIS) [1, 3]. They are challenged because traditional markers of infection mislead and there is lack of better evaluation methods for prognosis [1, 4–6]. To improve the outcome of treatments, diagnostic models are needed to accurately predict the development of sepsis as well as stratify its severity [7].

However, the clinical data of sepsis involved in diagnostic models are usually high dimensional. High-dimensional datasets increase the complexity of classification and reduce the effect of models [8]. Thus, before building models, it is necessary to reduce the data dimension while retaining essential information of the original data. Feature extraction and feature selection are the main methods in dimensionality reduction [2, 9].

*(A) Feature Extraction*. Feature extraction transforms the original feature space into a new one of lower dimension. Algorithms like Principal Component Analysis (PCA), Multidimensional Scaling (MDS), and Independent Component Analysis (ICA) are widely used for feature extraction. However, ICA and PCA are linear projection methods, and if the feature vectors distribute along a nonlinear manifold in a high-dimensional space, they might lead to classification errors [10, 11]. Besides, MDS is sensitive to undersampling datasets and has difficulty in dealing with defect data [12]. Furthermore, PCA, MDS, and ICA will generate new parameters after dimensionality reduction, and the significance of the new parameters is not always interpretable.

*(B) Feature Selection*. Feature selection is a kind of process that selects an optimal feature subset from the original features, which retains sufficient information [13]. Currently, quite a lot of feature selection algorithms have been developed, such as Genetic Algorithms (GAs), Support Vector Machines (SVM) Wrapper, Sparse Generalized Partial Least Squares Selection (PLS), and Particle Swarm Optimization (PSO) [14–17]. Among them, GAs are popularly utilized. However, in some multimodal optimization problems, GAs failed to maintain multiple global or local optima [13]. Thus many efforts have been made to improve the ability of GAs in achieving multiple peak solutions, by adding scaling fitness and adjusting fitness competence rule [18].

*(a) GAs*. Genetic Algorithms have been used to reduce the numbers of features in datasets [19–21]. Genetic Algorithm Pipe Network Optimization Model (GENOME) has been applied to optimize the design of new looped irrigation water distribution networks [22]. An online web-based feature selection tool (DWFS) was developed according to the GA-based wrapper paradigm [23]. However, when using GAs [24–26], it is difficult to handle problems such as nonlinear, singular, and multimodal ones. The key issue is that the population is easily trapped in a limited number of solutions; and premature solutions have no capability to obtain better results [18]. Therefore, the Niche Genetic Algorithms (NGAs) are introduced to build a better environment to resolve the problem.

*(b) NGAs*. The capability to locate multiple loci often permits NGAs to be robust and effective in solving multimodal optimization problems [27–29]. The Twin-space Crowding Genetic Algorithm (TCGA) and Game-Theoretic Genetic Algorithm (GTGA) are introduced in the literature [18, 30]. The reported work [31] showed that the Nondominated Sorting Genetic Algorithm (NSGA) lacks elitism and needs to specify the sharing parameter [32]. However, most niche methods require prior knowledge such as the niche radius or the distance threshold. Accordingly, the niche distance is either set randomly or set as fixed value in advance. These technologies are unable to adaptively obtain the niche distance following evolution and prone to eliminate the potentially excellent individuals [33, 34].

To address the problems, we proposed Improved NGA (INGA) algorithm with embedded self-adaptive niche-culling mechanism for dimensionality reduction. Since MDS and PCA are the typical feature extraction algorithms while GA and NGA are the typical feature selection algorithms, we compared the dimension reduction results of them with INGA to verify the validity of INGA in dimension reduction. By applying INGA, the improvement in the accuracy rate of sepsis diseases classification is noteworthy, while the data dimension is reasonably reduced.

#### 2. Method

The idea of NGA is applying the biological concept of a niche to evolutionary computations. It shows a survival environment with a prespecified distance parameter . The of NGA is set in advance, only allowing a single excellent individual in this distance. NGA has the following main disadvantages.(1)A fixed distance parameter affects the convergence rate. If the value of is too large, there will be lots of individuals within this distance and they need to be culled. This will lower the convergence rate. In contrast, if the value of is too small, there are no sufficient individuals and this will lead to premature convergence.(2)Single individual will inhibit potential individuals. Within the distance , only one single excellent individual is allowed and it will cause the elimination of potentially excellent individuals and make the result of the dimension reduction too large.(3)The diversity of the subpopulations is insufficient. Population diversity is closely related to subpopulations scale, but the subpopulations scale of NGA is set in advance and cannot be adjusted. It is difficult to find an optimum scale of subpopulations. As a result, if the subpopulations scale is too large, the diversity of the population is easy to be destroyed; on the contrary, the additional calculation of the algorithm will be increased.To address these problems, we developed Niche Elimination Operation, as shown in the part (A). Afterwards, INGA is constructed, as shown in part (B) (Figure 2).

*(A) Niche Elimination Operation*

*(a) Self-Adaptive Survival Distance*. The distance parameter is designed to be self-adaptive with the Euclidean distance among individuals of each generation to avoid the convergence problem caused by preset : and are two individuals of the current population, which are made up of loci genetics. is the number of individuals in the current population. is the number of loci, which is used to form and evaluate the lengths of individuals. and are the values of loci. The distance parameter is calculated byBecause individuals of each generation are different and the values of the distance parameter vary with generation, a reasonable distance parameter will be obtained in the evolutionary process of each generation to get a better niche environment.

*(b) Similarity Criterion*. Allowing one single excellent individual within , this will cause the elimination of potentially excellent individuals which may not be similar to the retained excellent. So, within the distance parameter , the similarities of biallelic loci are used to judge the similarity of the individuals and determine whether the individuals should be retained.

The similarity of biallelic loci and average similarity between two individuals are given by the following two equations:where represents the similarity between two individuals, and . is the number of the same allele value of two individuals. Consider represents the average similarity between the th individual and the others. When , the similarity between two individuals will be distinguished. If the similarity is larger than the average similarity, the individual that has a lower fitness will be given a penalty function, as shown in the following equation. Otherwise, the lower fitness individuals can be retained:where is the original fitness of the individual, is the new fitness, and is the penalty function (usually ). This method can reduce the elimination of individuals.

*(c) Maintain Population Diversity*. To maintain the diversity of the population, the scale of the subpopulations should be controlled. So (6) and (7) are designed with a memory pool of optimal individuals to limit the scale for the subpopulations of each generation: where represents the average fitness value of generation , represents the fitness of individual in generation , and is the scale of the population in generation . Thus, the scale of subpopulations in generation is . This is calculated asA memory pool of optimal individuals is designed to exchange excellent evolutionary individuals. The operation increases the possibility of obtaining more excellent individuals, and to some extent, avoids the problem of premature convergence during the evolutionary process of a single population. The individuals of general are sorted by fitness, and the formers are put into the memory pool.

Through the result of , the ability of maintaining the population diversity, , is designed as in the following two equations. The smaller the value of is, the higher its population diversity is:where is the capability to maintain the population diversity in generation . And is designed as follows:where is the length of the individual encoding, is the scale of the population in generation , and is the th loci of the th individual.

*(B) Flowchart of INGA*

*Step 1 (calculate fitness). *At first, initial individuals are produced at random. Usually, it takes the reciprocal of the sum of error square of the classifier test set data as fitness function [33] in order to fully reflect the advantage of controlling errors by combining INGA with classifier:where is the predicted value of test set, is the true value of test set, and is the sample number of test set. Individuals are sorted by fitness in descending order, and the former individuals are remembered in the memory pool ().

*Step 2 (Niche Elimination Operation to produce excellent initial individuals). *In this step, the excellent initial individuals are produced, as shown in Figure 1.(a)*Self-Adaptive Survival*. First, calculate the Euclidean distance between and according to (1). Second, calculate self-adaptive survival distance according to (2).(b)*Similarity Criterion*. Judge the similarity of the individuals within the distance according to the method of allele contrast, so as to determine whether the individual should be retained. When , the similarity of biallelic loci and average similarity between two individuals are compared. If they are not similar, the individual of lower fitness needs not to be eliminated. The similarity of biallelic loci and average similarity between two individuals are given by (3) and (4). When , then is punished, using a penalty function according to (5). If not, the individual with lower fitness will be retained. On the other hand, when , the individual with lower fitness will be retained.(c)*Maintaining Population Diversity*. According to (7), the number of subpopulations is calculated. Individuals are sorted by fitness in descending order, if the scale of the existing subpopulation is larger than , select the individuals ; otherwise, individuals are merged in the memory pool with the existing subpopulations and sorted by fitness in descending order; when , the former individuals of are selected; when , individuals will be generated randomly; individuals are selected, on the condition that . Through this method, the initial population will have a higher average fitness and will be conducive to the evolution of population towards the solution of the problem.