Genetic Algorithm Based Approach in Attribute Weighting for a Medical Data Set
Genetic algorithms have been utilized in many complex optimization and simulation tasks because of their powerful search method. In this research we studied whether the classification performance of the attribute weighted methods based on the nearest neighbour search can be improved when using the genetic algorithm in the evolution of attribute weighting. The attribute weights in the starting population were based on the weights set by the application area experts and machine learning methods instead of random weight setting. The genetic algorithm improved the total classification accuracy and the median true positive rate of the attribute weighted k-nearest neighbour method using neighbour’s class-based attribute weighting. With other methods, the changes after genetic algorithm were moderate.
One of the most commonly used simple classification methods is the nearest neighbour (NN) method that classifies a new case into the class of its nearest neighbour case . The nearest neighbour method is an instance-based learning method that searches for the most similar case of the test case from the training data by some distance measure, usually with the Euclidean distance. A natural extension to NN is the k-nearest neighbour (k-NN) method that assigns the majority class of the k nearest training cases for the test case . Different refinements and extensions have been proposed for k-NN in order to improve classification results and overcome classification problems, for example, distance-weighting of neighbours , extensions using properties of the data set , weighting of attributes [2, 4, 5], and attribute weight optimization with genetic algorithms (GA) [6–11].
Genetic algorithms [12, 13] and other evolution algorithms [14, 15] have been utilized in various complex optimization and simulation problems because of their powerful search and optimization capabilities. A search method of a genetic algorithm is a combination of directed and stochastic search and the search can be done multidirectionally because GA maintains a population of potential solutions from the search space . The basics of the search method of GA underlie in natural selection and genetic inheritance ; individuals of the population are used in the reproduction of new solutions by means of crossover and mutation. Genetic algorithms have been used with various machine learning methods to optimize weighting properties of the method. Since our research is based on the nearest neighbour search applying machine learning methods, we concentrate on related works where GAs have been applied only with the k-nearest neighbour method. Kelly and Davis  combined the GA with a weighted k-nearest neighbour (wk-NN) method in the algorithm called GA-WKNN in order to find a single attribute weight vector that would improve the classification results of the wk-NN. A similar kind of approach was used in  where GA was combined with the wk-NN and a parallel processing environment in order to optimize classification of large data sets. In both studies, a set of real-valued weights for attributes to discriminate all classes of data were achieved as a result after GA runs. The study of Hussein et al.  showed that GA can be applied successfully in setting a real-valued weight set for 1-NN classifier but the improvement of accuracy happened at the expense of increase in processing time. Results showed that GA methods combining the wk-NN outperformed the basic k-NN [6–8]. However, a single set of weights for all classes is not always the best solution because attributes have a different effect on classes . Therefore, solutions for searching for a weight for each class and attribute have been developed. Lee et al.  combined the GA-based attribute weighting method with a modified k-NN, thus, forming an adaptive feature weighting method A3FW-MNN that used different sets of attribute weights for different classes. Also, Mateos-García et al.  assigned different weights to every attribute depending on each class in their evolutionary algorithm called Label Dependent Feature Weighting (LDFW) algorithm.
In this research we studied whether the classification performance of the attribute weighted machine learning methods based on the nearest neighbour search can be improved when using the genetic algorithm in the evolution of attribute weighting based on the experts and machine learning methods when runs were made with a medical data set. This medical data has been our test data in our previous researches [16, 17].
In this research an otoneurological data set having 951 cases from seven different vertigo diseases (classes) (Table 1) was used. The data was collected over a decade starting from the 1990s in the Department of Otorhinolaryngology at Helsinki University Central Hospital, Finland, where experienced specialists confirmed all the diagnoses. The distribution of the disease classes is imbalanced; over one-third of the cases belong to the Menière’s disease class (36.8%), whereas the smallest disease class benign recurrent vertigo has only 2.1% of the cases.
In total, the data includes 176 attributes concerning a patient’s health status: occurring symptoms, medical history, and clinical findings in otoneurologic, audiologic, and imaging tests [18, 19]. Clinical testing has not been done to every patient and, therefore, there are several test results that have missing values of the attributes. Attributes with low frequencies of available values were left outside this research. After leaving out the attributes having over 35% missing values, 94 attributes remained to be used in this research: 17 quantitative (integer or real value) and 77 qualitative attributes (of which 54 were binary (yes/no), 20 were ordinal, and 3 were nominal). Genetic algorithm runs were done with the data including missing attribute values.
3. Genetic Algorithm
The basic idea of the genetic algorithm is the following: in the beginning, a population of individuals is formed either randomly or with information about the application domain. Traditionally, a binary representation of the individuals has been used but in multidimensional and numerical problems real-valued representation is nowadays used . In each generation, the individuals of the population are evaluated with an objective evaluation function, thus, giving the individual its fitness rate. A selection method is used to find the fittest individuals for a new population. Some individuals of the new population undergo reproduction by means of crossover and mutation. In the crossover, the information of the individuals is swapped in their corresponding elements. Mutation alters one or more elements of the individual arbitrarily. Elitism is a commonly applied survivor selection method. It keeps the current fittest individual unchanged in the population so the high-performance individuals are not lost from one generation to the next . The GA can be ended after a fixed number of iterations or if no further improvement is observed after some number of generations.
We utilized the genetic algorithm in the evolution of the attribute weight values. A pseudocode of the used genetic algorithm is given in Pseudocode 1. A population contained 21 individuals that used real-valued representation instead of binary presentation because the attribute weight values were described with real-valued numbers, not just with 0 and 1. Each individual consisted of seven different attribute weight sets for 94 attributes. The individuals of the starting population were based on the weights set by the experts and machine learning methods. The starting population is defined more accurately in Section 3.1. The genetic algorithm used a roulette-wheel selection in parent selection and a uniform crossover with discrete recombination in offspring creation. The crossover was done in 80.0% probability () and the crossover points were selected randomly and independently for each gene (a field on an individual). Mutation was done in 1.0% probability () for the gene and it was done also in a uniform manner: a random value was drawn from the range which was set as a new value in the current position. In addition, elitism was used in order to keep the best individual within the population during runs. We did not want to lose the best performing weight set during the evolution. If the number of the individuals was higher than 21 in the end of the generation, a survivor selection was used. The individuals were ordered by their classification performance and the individuals with the lowest accuracy were discarded from the population. The genetic algorithm ended after 20 generations or if the best classification accuracy maintained the same during 10 successive generations. Furthermore, if all the individuals were the same in the population, the evaluation ended. The parameters used in the GA runs are described in Table 2.
The genetic algorithm runs were done separately with three different machine learning methods used in the population evaluation: with the nearest pattern method of the otoneurological expert system (ONE), with the attribute weighted k-nearest neighbour method using neighbour’s class based attribute weighting (cwk-NN), and with the attribute weighted k-nearest neighbour method using one-versus-all the other (OVA) classifiers (wk-NN OVA). The evaluation methods are defined more accurately in Section 3.2. During the genetic algorithm runs, for each individual in the population its fitness rate was calculated with the method at hand; that is, the individual was evaluated against the method. Within the methods cwk-NN and ONE, the fitness rate for the individual was defined with a total classification accuracy (ACC) and within the wk-NN OVA with a true positive rate (TPR). The total classification accuracy was used with the ONE and the cwk-NN because all seven disease classes were classified at the same time whereas the wk-NN OVA concentrated on one disease class (and its weight set) at a time. During GA wk-NN OVA runs, it was more important to find the weight set that separated well the cases of the disease class at hand from the others than to classify the other cases also well.
The total classification accuracy showed the percentage of all correctly classified cases within the data set: where was the total number of cases correctly classified within classes and was the total number of cases used in the classification. The true positive rate expressed the percentage of correctly inferred cases within the class as where was the number of correctly classified cases in class and was the number of all cases in class . With the cwk-NN and wk-NN OVA methods, the classification performance was calculated from the seven nearest neighbour method (7-NN) results and with the ONE from the first diagnosis suggestion (ONE1). However, for disease class benign recurrent vertigo (BRV) with the wk-NN OVA method it was necessary to use the TPR of three nearest neighbours (3-NN) as the fitness rate because of the small size of the disease class at hand. Otherwise the TPR for classifying BRV would have always been zero. Nonetheless, if there occurred a situation where TPR of 3-NN was zero with all individuals in the starting population, a new population was created randomly and evaluated. Random new population was created at most ten times and if the TPR did not change during 10 runs, GA run was ended.
A 10-fold cross-validation (CV)  was used in evaluating the classification performance of the genetic algorithm. The data was randomly divided into 10 subsets of approximately equal size. The division was made in a stratified manner to ensure that the class distribution of each subset resembled the skewed class distribution of the entire data set. In the beginning, one cross-validation partition (10% of the data) was left aside to test the performance of the found best individual after genetic algorithm run. The nine cross-validation partitions (90%) were used during the training process. In order to calculate the fitness rate for each individual in the population during genetic algorithm runs, the training data was further divided into two parts: six cross-validation parts were used for training and three cross-validation parts were used for testing the current machine learning method used in the fitness rate calculation. Thus, during the genetic algorithm run 60%–30% data division was used. After the genetic algorithm run, the individual having the highest fitness rate was declared as a result of weight combination and it was then tested with the left aside test data subset. The 10-fold cross-validation was repeated ten times. In total, there were 100 test runs per each evaluation method used in the genetic algorithm. The same cross-validation divisions were used with all the evaluation methods—that is, each method had the same training and testing sets used during the genetic algorithm runs.
3.1. Starting Population
The starting population consisted of 21 individuals. Each individual included seven different attribute weight sets (weights for 94 attributes), one set for each disease class. Instead of selecting the starting individuals at random, we decided to use good “guesses” as a starting point. Therefore, the starting individuals were based on the attribute weights defined by the domain experts (three different weight set versions) and learnt by three machine learning methods (the Scatter method [21–23] and the weighting method of the instance-based learning algorithm IB4  and its variant IB1w). Based on the weight sets defined by the experts and the machine learning methods, two different modifications were created from weight sets with 50% random mutation, thus having 18 weight sets in total. In addition to these, three totally random weight sets were created into the starting population.
The weight values were computed with the machine learning methods from the imputed data set, that is, from the data set where the missing values of attributes were substituted with the class-wise modes of the qualitative and the class-wise medians of the quantitative attributes. In total, 10.1% of the values of attributes were missing in the data set. The imputation was done class-wise on the basis of the whole data prior to data division into training and testing sets. The calculation of the weights was repeated 10 times for each CV training set in the Scatter, IB4, and IB1w methods and the mean weights of the 10 repetitions were used in the classification to handle the randomness in these methods. The weights defined by the application area experts were the same for each CV training set.
The experts’ weights were based on three different combinations. The first weight set included the original attribute weights defined by a group of experienced otoneurological physicians for the decision support system ONE made in the 1990s . The second and the third weight sets were defined by two domain specialists during the upgrade process of the decision support system in the 2000s .
The Scatter method is normally used for attribute importance evaluation [21–23]. It calculates a scatter value for an attribute that expresses the attributes’ power to separate classes in the data set. For attribute weighting purposes, the scatter values were calculated for each attribute in different class versus other classes’ situations. In order to use the scatter values as attribute weights, it was necessary to take inverses of scatter values.
The weight calculation method of the IB4 classification method computes attribute weights independently for each class with a simple performance feedback algorithm . The attribute weights of IB4 reflect the relative relevancies of the attributes in the class. The difference between IB4 and its simpler version IB1w is that IB1w saves all processed cases in its class descriptions and does not discard any cases from the class descriptions during runs. Also, the cases with poor classification records are kept in class descriptions with IB1w whereas IB4 discards these cases based on their past performance during classification.
More detailed description of the machine learning methods Scatter, IB4, and IB1w and their use in weight formation will be given in the paper .
In order to have different weight sets comparable to each other during the genetic algorithm runs, the attribute weights were normalized into range . The values of each weight set were divided by the highest weight value occurring in the weight calculation method at issue.
3.2. Evaluation Methods
3.2.1. Nearest Pattern Method of ONE
The first method used within the genetic algorithm to evaluate the performance of the individuals in the population was the inference mechanism of the otoneurological decision support system ONE . Its inference mechanism resembles the nearest neighbour methods of pattern recognition. Instead of searching for the nearest case from the training set, it searches for the most fitting class for a new case from its knowledge base.
In the knowledge base of ONE, a pattern is given to each class that corresponds to one vertigo disease. The pattern can be considered a profile of a disease as it describes its related symptoms and signs. Each class in the knowledge base is described with a set of attributes with weight values expressing their significance for the class. In addition, a fitness value for each attribute value is given to describe how it fits the class. The fitness values for attribute values were computed on the basis of the 60% part of training data. Fitness values can have values between 0 and 100. The fitness value 0 means that the attribute value does not fit the class, whereas the fitness value 100 shows that the value fits the class perfectly. The weight values for attributes were given in the population in the GA; thus, the weight values varied from 0 to 1. The greater the weight value is, the more important the attribute is for the class.
The inference mechanism calculates scores for the classes from the weight and fitness values of the attributes. The score for a class is calculated in the following way: where is the number of the attributes associated with class , is 1 if the value of attribute is known and otherwise 0, is the weight of the attribute for class , and is the fitness value for the value of the attribute for class . In the case of quantitative attributes, the fitness values are interpolated by using the attribute values in the knowledge base as interpolation points. The fitness values are altered to the range of 0 to 1 during the inference process. In addition to the score, the minimum and maximum scores are calculated for the classes using the lowest and the highest fitness values for the attributes having missing values.
The classes are ordered primarily by the score and secondarily by the difference of the minimum and maximum score. If the classes have the same score but one class has a smaller difference between the minimum and maximum scores than the others, the class having the smallest difference is placed higher in order. If the classes have the same score and the minimum and maximum score difference, their order is selected randomly. The class having the highest score is referred to as the best diagnosis suggestion.
Some vertigo diseases resemble each other by having a similar kind of symptoms with other diseases during some phase of the disease and, in addition, some patients can actually have two (or more) vertigo diseases present concurrently . Therefore, it is good to check the classification results of ONE with more than one disease suggestion. In the end, the final diagnostic choice must be made by the physician based on the information given on all alternative diseases .
3.2.2. Attribute Weighted k-Nearest Neighbour Method Using Neighbour’s Class-Based Attribute Weighting
The other method used in the population evaluation was the attribute weighted k-nearest neighbour method using neighbour’s class-based attribute weighting (cwk-NN). The distance measure of the basic k-nearest neighbour method  was expanded to take the attribute weighting into account . Lee et al.  used a similar class-dependent attribute weighting with their modified k-nearest neighbour method where different attribute weight sets for different classes were determined with the adaptive-3FW feature weighting method. With our cwk-NN the attribute weighting depends on the disease class of the neighbour case. Thus, there ought to be as many attribute weights sets available as there are classes.
The distance measure used with the cwk-NN was the Heterogeneous Value Difference Metric (HVDM)  expanded with the attribute weighting. HVDM was used because it can handle both qualitative and quantitative attributes in the data set. The attribute weighted HVDM is defined as where is the number of attributes, is the disease class of the case , is the weight of the attribute in class , and is the distance between the values and for attribute . The distance function is defined as Because HVDM computes distances to qualitative and other attributes with different measurement ranges, it is necessary to scale their results into approximately the same range in order to give each attribute a similar influence on the overall distance . The normalized distance to a quantitative attribute is calculated with (6): where is the standard deviation of the numeric values of attribute in the training set of the current classifier, and to a nominal attribute with (7): where is the number of output classes in the problem domain (in this case ), is the number of cases in that have a value (or a value ) for attribute and the output class , and is the number of cases in that have a value (or a value ) for attribute . In other words, we are calculating the conditional probabilities to have the output class when having attribute with the value (or the value ).
This approach allowed modifications of all the weights at the same time.
3.2.3. Attribute Weighted k-Nearest Neighbour Method Using One-versus-All Classifiers
In addition to the neighbour’s class-based attribute weighting the attribute weighted k-nearest neighbour method was tested with one-versus-all classifiers (wk-NN OVA). Within this method, the multiclass classification problem was converted into multiple binary classifiers—that is, the class problem was divided into binary problems . Each binary OVA classifier was trained to separate a class from all the other classes by marking the cases of this one class as member cases and the cases of the other classes as nonmember cases in the training set.
The attribute weighted k-NN OVA is an instance-based learning method that searches for the k most similar cases (neighbours) of a new case from each classifier separately. There is one classifier per each class and each classifier gives a vote for the case being a member or nonmember of the class based on the majority class of the k neighbours. The final class of the new case is assigned from a classifier suggesting the case being a member of a class. There can occur a situation in which the new case gets more than one member of a class vote (a tie situation) or all of the classifiers vote for the other class (the case to be a nonmember of all the classes). In a tie situation the class of the new case is determined by searching for the most similar member case from the member voting classifiers. The case gets the class of the member case with the shortest distance to it. When all the classifiers vote for the case to be a nonmember, the basic 1-nearest neighbour classifier using the whole training data containing the original disease classes is employed to find the most similar case (and its class) for the new case.
The distance measure used in the wk-NN OVA was also the HVDM measure. The difference in the HVDM description in (4) is that the is the class of the classifier at issue, not the class of the case . In addition, in (7) wk-NN OVA has two output classes (). The data in the learning set of the classifier is divided into the member and nonmember classes.
The results of the GA runs with ONE and cwk-NN as an evaluation method were the averages of the 10 times repeated 10-fold cross-validation whereas the results with the wk-NN OVA were the averages of the 5 times repeated 10-fold cross-validation. The 10-fold cross-validation was repeated only five times with the GA wk-NN OVA due to its huge computation time. For example, the evaluation of a population (21 individuals in one generation in a GA run) in one cross-validation set with the GA ONE lasted 3 minutes and 25 seconds, with the GA cwk-NN 48 minutes and 54 seconds, and with the GA wk-NN OVA 4 hours, 57 minutes, and 8 seconds when running the GA with the computer C1 (Table 3). With the other computers, the computation was even slower. Thus, at worst, the computation time of one cross-validation set lasting 20 generations with the computer C1 and GA wk-NN OVA was over four days (over 12 days with C4) assuming that within each generation all individuals were evaluated. In practice, the number of evaluated individuals varied within generations due to the crossover and the mutation. Notice that computers C4 and C5 were servers having several other users simultaneously and, thus, we had only minor part of their CPU in use. During GA cwk-NN and GA wk-NN OVA runs, the GA was run parallel in five computers, thus, having at best 11 parallel GA runs in process. GA ONE was run only with the computer C1.
The number of generations in the GA runs with all used evaluation methods varied from 10 to 20. In total, 75.0%, 18.0%, and 82.9% of GA runs ended before the 20th generation due to having the same best accuracy (GA ONE and GA cwk-NN) or TPR (GA wk-NN OVA) in 10 consecutive GA runs with ONE method, cwk-NN, and wk-NN OVA, respectively (Table 4). With the GA wk-NN OVA, all the GA runs with the disease classes sudden deafness, traumatic vertigo, and benign recurrent vertigo ended before the 20th generation and with the other classes from 58.0% to 88.0% of the runs. If the number of ending generation was 10, this meant that the best ACC or TPR in the population did not change at all during the GA run and, therefore, the run was ended. GA cwk-NN ended after 10 generations only in 6.0% of the GA runs whereas GA ONE and GA wk-NN OVA ended during the GA runs around half of runs (in 48.0% and 54.9% of runs, resp.). In the GA wk-NN OVA runs, this happened especially with disease class traumatic vertigo where all CV runs ended after 10 generations and with sudden deafness (96.0%) and benign recurrent vertigo (94.0%). The other disease classes ended during the GA wk-NN OVA runs after 10 generations from 12.0% (acoustic neurinoma) to 34.0% (vestibular neuritis) of the runs. Most of the GA cwk-NN runs lasted 20 generations (82.0%) whereas only a fourth of the GA ONE runs and 17.1% of the GA wk-NN OVA runs went through 20 generations.
Within the GA wk-NN OVA runs of the disease class benign recurrent vertigo occurred situations where the TPRs in the starting population were zero regardless of using the TPR of 3-NN instead in population evaluation. The TPR of 3-NN was used with BRV instead of 7-NN because of the small size of the disease class. The TPRs of starting individuals were zero in 30 out of 50 cross-validation sets within the GA wk-NN OVA run concentrating on the BRV. In this case, new starting individuals were created randomly. Random individual creation was repeated in different cross-validation sets from one to five and nine times. The GA wk-NN OVA run ended if the TPR of starting population stayed zero ten times. This happened in 14 (28.0%) cross-validation sets only with the disease class benign recurrent vertigo.
In order to see the effect of genetic algorithm on the population, we examined the worst and the best total classification accuracies of individuals (the attribute weight vectors) in the beginning and in the end of the genetic algorithm run. The mean worst and the mean best total accuracies and their standard deviations with GA runs using ONE and cwk-NN as an evaluation method were calculated from 10 times repeated 10-fold cross-validation and with GA runs using wk-NN OVA from 5 times repeated 10-fold cross-validation (Table 5). The mean best accuracies stayed approximately the same with the GA ONE, whereas the mean best accuracy increased 4.7% with the GA cwk-NN and decreased 0.6% with the GA wk-NN OVA. The improvement can be seen from the mean worst classification accuracies: the worst accuracy occurring in the population increased during GA runs, especially with the GA cwk-NN (28.3%). With the GA ONE, the mean worst accuracy improved 11.6% when using at most 20 generations and 16.7% when using at most 100 generations. With the GA wk-NN OVA, the improvement was moderate (3.4%) but one must notice that its mean worst classification accuracy was already over 75% in the starting population, which was better than the mean best accuracies of the other methods.
The more detailed results of the GA ONE, the GA cwk-NN, and the GA wk-NN OVA runs in the beginning and in the end with the best individual occurring in the population are given in Tables 6 and 7. The true positive rates of the disease classes are shown with GA ONE for the first (ONE1), the first and second (ONE12), and the first, second, and third (ONE123) diagnosis suggestions of ONE and with GA cwk-NN and GA wk-NN OVA for one, three, five, seven, and nine nearest neighbours (1-NN–9-NN). During cross-validation runs in GA, the individuals were evaluated by the total classification accuracy of the ONE1 with the GA ONE and of the 7-NN with the GA cwk-NN and by the true positive rate of the 7-NN with the GA wk-NN OVA (except with disease class BRV that used the TPR of 3-NN). The true positive rate was used as a fitness rate with the GA wk-NN OVA instead of the total accuracy because it concentrated on classifying one disease class at a time whereas GA ONE and GA cwk-NN classified all seven disease classes at the same time.
Within 20 generations lasting GA, the best improvement between the start population and the end population was yielded with the GA cwk-NN that improved the total classification accuracies and the mean true positive rates when using one to nine nearest neighbours in the classification. Total classification accuracy of the GA cwk-NN rose at best 5.1% (in 9-NN) and median TPR 10.7% (in 9-NN). The GA had a smaller effect on the results of the GA ONE and the GA wk-NN OVA. The results in the start population and in the end population stayed quite near each other. Small improvement in the mean total classification accuracy and the mean TPR can be seen with the GA wk-NN OVA using one or three nearest neighbours in the classification. Otherwise, the total classification accuracies decreased a bit when using the GA ONE and with the GA wk-NN OVA using five or seven nearest neighbours in the classification.
Changes within the true positive rates of disease classes compared to the start and end results varied between methods. The GA cwk-NN mainly increased the TPRs. During GA runs, it increased the most the TPR of acoustic neurinoma (22.6% in 1-NN) and traumatic vertigo (16.3% in 9-NN). Menière’s disease was the only class where the TPR decreased (at worst −7.3% in 1-NN) during GA cwk-NN runs. With the GA ONE, the TPRs of classes mainly decreased. It decreased the most the TPR of benign recurrent vertigo (−7.5% in ONE12) and sudden deafness (−3.4% in ONE12). However, small increase in TPR can be seen with acoustic neurinoma (1.0% in ONE12) and with vestibular neuritis (0.8% with ONE12). With the GA wk-NN OVA, some TPRs increased and some decreased. The TPR increased the most with traumatic vertigo (5.8% in 3-NN) and sudden deafness (3.8% in 7-NN) and decreased the most with benign recurrent vertigo (−3.0% in 9-NN) and benign positional vertigo (−2.9% in 7-NN).
Because the computation time with the ONE method was so much faster than with the k-nearest neighbour methods, the evolution of the population with GA ONE runs was tested also with 100 generations in addition to the 20 generations. The ending condition was also changed: the GA run ended if the maximum accuracy stayed the same in 50 successive runs or 100 generations were run. In total, 39.0% of the GA ONE100 runs ended before the 100th generation and within 12.0% of the runs there was no change in the best total classification accuracy during 50 generations (Table 4). The classification results of the GA ONE100 runs are given in Table 8. The increase of generations from 20 to 100 did not affect much the mean total classification accuracy nor the mean median TPR. Within disease classes, benign recurrent vertigo suffered the most from the generation increase: its true positive rate decreased at worst −16.0% (ONE12) compared to the starting population and −9.5% (ONE123) compared to the 20th generation. The best TPR increase was achieved with acoustic neurinoma: 3.9% from the starting population and 3.6% from the 20th generation.
5. Discussion and Conclusion
Genetic algorithm runs were done with three different population evaluation methods in order to see whether the classification performance of the attribute weighted methods based on the nearest neighbour search can be improved when using the genetic algorithm in the evolution of attribute weighting. The attribute weighting in the starting population was based on the weights described by the application area experts and machine learning methods instead of random weight setting. The genetic algorithm runs were done separately with the nearest pattern method of ONE (GA ONE), with the attribute weighted k-nearest neighbour method using neighbour’s class-based attribute weighting (GA cwk-NN), and with the attribute weighted k-nearest neighbour method using one-versus-all classifiers (GA wk-NN OVA). The 10-fold cross-validation was repeated 10 times with GA ONE and GA cwk-NN and 5 times with GA cwk-NN OVA due to its huge computation time.
The GA runs lasted at maximum 20 generations, 10 generations if there were no change in the best classification accuracy. Most of the GA runs with GA ONE and GA wk-NN OVA ended before the 20th generation (75.0% and 82.9%, resp.) and around half (!) of the GA runs ended without a change in the best classification (ended after 10 generations; 48.0% and 54.9%, resp.). Only 18.0% of the GA cwk-NN runs ended before the 20th round and 6.0% after 10 generations.
The total classification accuracies and the mean true positive rates were improved within GA cwk-NN runs whereas with GA ONE and GA wk-NN OVA the results in the beginning and in the end population stayed quite near each other. One reason why the GA did not improve much the total classification accuracies with the GA ONE and the GA wk-NN OVA might be that the attribute weights used in the starting population were already optimized for separate disease classes. In addition, also the fitness values for ONE method can be said to be the best occurring fitness values because they were computed from the otoneurological data with the machine learning method.
Hussein et al.  noticed that in some applications a strict cost-benefit analysis may rule out the use of genetic algorithm optimization because of its increase in processing time (e.g., 100–150% increase in counting time compared to the basic classifier with 200 train and test cases and over 400% when using 3824 train cases and 1797 test cases with k-NN leave-one-out). Also, Kelly and Davis  admit that it can take a tremendous amount of time to find high-performance weight vectors for variably weighted machine learning methods. The results in  showed that the extensions of the k-NN yielded generally better results at the cost of speed since all extensions required a training phase. In this research, the GA wk-NN OVA was really time-consuming compared to GA cwk-NN and GA ONE. However, if the weight calculation needs to be done only once or quite seldom, the time issue is not that crucial, especially if it improves the performance of the method.
In this study the weights set by the experts and learnt by machine learning methods were used as a starting point. This helped a lot the search of appropriate weights but there might be different attribute weight combinations with as good or even better classification results. Therefore it would be good to test genetic algorithm also with totally random starting population and with several different parameters in offspring creation and mutation.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The first author acknowledges the support of Onni and Hilja Tuovinen Foundation, Oskar Öflund’s Foundation, and Finnish Cultural Foundation, Päijät-Häme Regional fund who granted scholarships for her postgraduate studies. The authors are grateful to Docent E. Kentala, M.D., and Professor I. Pyykkö, M.D., for their help in collecting the otoneurological data and medical advice.
T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.View at: Google Scholar
T. Mitchell, Machine Learning, McGraw-Hill, New York, NY, USA, 1997.
Z. Voulgaris and G. D. Magoulas, “Extensions of the k nearest neighbour methods for classification problems,” in Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications (AIA '08), pp. 23–28, ACTA Press, Anaheim, Calif, USA, February 2008.View at: Google Scholar
J. M. Sotoca, J. S. Sánchez, and F. Pla, “Estimating feature weights for distance-based classification,” in Proceedings of the 3rd International Workshop on Pattern Recognition in Information Systems (PRIS '03), Angers, France, 2003.View at: Google Scholar
E. Marchiori, A. Ngom, E. Formenti, J.-K. Hao, X.-M. Zhao, and T. van Laarhoven, “Class dependent feature weighting and k-nearest neighbor classification,” in Proceedings of the 8th IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB '13), vol. 7986 of LNBI, pp. 69–78, Springer, Berlin, Germany, 2013.View at: Google Scholar
J. D. Kelly and L. Davis, “A hybrid genetic algorithm for classification,” in Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI '91), vol. 2, pp. 645–650, Morgan Kaufmann, San Franciso, Calif, USA, 1991.View at: Google Scholar
W. F. Punch, E. D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. Enbody, “Further research on feature selection and classification using genetic algorithms,” in Proceedings of the 5th International Conference on Genetic Algorithms (ICGA '93), pp. 557–564, University of Illinois, Champaign, Ill, USA, 1993.View at: Google Scholar
F. Hussein, N. Kharma, and R. Ward, “Genetic algorithms for feature selection and weighting, a review and study,” in Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR '01), pp. 1240–1244, Seattle, Wash, USA, 2001.View at: Google Scholar
H. Lee, E. Kim, and M. Park, “A genetic feature weighting scheme for pattern recognition,” Integrated Computer-Aided Engineering, vol. 14, no. 2, pp. 161–171, 2007.View at: Google Scholar
D. Mateos-García, J. García-Gutiérrez, and J. C. Riquelme-Santos, “Label dependent evolutionary feature weighting for remote sensing data,” in Proceedings of the 5th International Conference on Hybrid Artificial Intelligence Systems, pp. 272–279, Springer, 2010.View at: Google Scholar
D. Mateos-García, J. García-Gutiérrez, and J. C. Riquelme-Santos, “On the evolutionary optimization of k-NN by label-dependent feature weighting,” Pattern Recognition Letters, vol. 33, no. 16, pp. 2232–2238, 2012.View at: Google Scholar
D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, Mass, USA, 1989.
M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, Cambridge, Mass, USA, 1996.
K. Varpa, K. Iltanen, M. Juhola et al., “Refinement of the otoneurological decision support system and its knowledge acquisition process,” in Proceedings of the 20th International Congress of the European Federation for Medical Informatics (MIE '06), pp. 197–202, Maastricht, The Netherlands, 2006.View at: Google Scholar
K. Varpa, K. Iltanen, M. Siermala, and M. Juhola, “Attribute weighting with scatter and instance-based learning methods evaluated with otoneurological data,” International Journal of Computational Medicine and Healthcare, 2013.View at: Google Scholar
K. Viikki, Machine learning on otoneurological data: decision trees for vertigo diseases [Ph.D. thesis], Department of Computer Sciences, University of Tampere, Tampere, Finland, 2002, http://urn.fi/urn:isbn:951-44-5390-5.
K. A. De Jong, Analysis of the behaviour of a class of genetic adaptive systems [Ph.D. thesis], Computer and Communication Sciences Department, The University of Michigan, Ann Arbor, Mich, USA, 1975, http://hdl.handle.net/2027.42/4507.
M. Juhola and M. Siermala, “Scatter Counter program and its instructions,” 2014, http://www.uta.fi/sis/cis/research_groups/darg/publications/scatterCounter_2_7_eng.pdf.View at: Google Scholar
E. Kentala, Y. Auramo, M. Juhola, and I. Pyykkö, “Comparison between diagnoses of human experts and a neurotologic expert system,” Annals of Otology, Rhinology and Laryngology, vol. 107, no. 2, pp. 135–140, 1998.View at: Google Scholar
E. Kentala, Y. Auramo, I. Pyykkö, and M. Juhola, “Otoneurological expert system,” Annals of Otology, Rhinology and Laryngology, vol. 105, no. 8, pp. 654–658, 1996.View at: Google Scholar
M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, “An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761–1776, 2011.View at: Publisher Site | Google Scholar