Abstract

Background. Guillain-Barré Syndrome (GBS) is a potentially fatal autoimmune neurological disorder. The severity varies among the four main subtypes, named as Acute Inflammatory Demyelinating Polyneuropathy (AIDP), Acute Motor Axonal Neuropathy (AMAN), Acute Motor Sensory Axonal Neuropathy (AMSAN), and Miller-Fisher Syndrome (MF). A proper subtype identification may help to promptly carry out adequate treatment in patients. Method. We perform experiments with 15 single classifiers in two scenarios: four subtypes’ classification and One versus All (OvA) classification. We used a dataset with the 16 relevant features identified in a previous phase. Performance evaluation is made by 10-fold cross validation (10-FCV). Typical classification performance measures are used. A statistical test is conducted in order to identify the top five classifiers for each case. Results. In four GBS subtypes’ classification, half of the classifiers investigated in this study obtained an average accuracy above 0.90. In OvA classification, the two subtypes with the largest number of instances resulted in the best classification results. Conclusions. This study represents a comprehensive effort on creating a predictive model for Guillain-Barré Syndrome subtypes. Also, the analysis performed in this work provides insight about the best single classifiers for each classification case.

1. Introduction

Guillain-Barré Syndrome (GBS) is an autoimmune neurological disorder characterized by a fast evolution; usually it goes from a few days up to four weeks. Complications of GBS vary among subtypes, which can be mainly Acute Inflammatory Demyelinating Polyneuropathy (AIDP), Acute Motor Axonal Neuropathy (AMAN), Acute Motor Sensory Axonal Neuropathy (AMSAN), and Miller-Fisher Syndrome (MF) [1, 2].

Current GBS subtype classification method consists of a clinical inspection by physicians guided by criteria established by specialists. This initial diagnostic is reinforced by neuroconduction tests, which help to differentiate among subtypes [1]. This current method implies performing long, expensive, and annoying tests. Some previous efforts in GBS have been focused to predict outcome at 6 months in the acute phase of GBS using clinical characteristics [3], early recognition of poor prognosis [4], and prediction of respiratory insufficiency [57]. No publication to date has been found of studies using machine learning methods for GBS subtypes classification.

In this study, we investigate the predictive power of a reduced set of only 16 features selected out from an original dataset of 365 features. This dataset holds data from 129 Mexican patients and contains the four aforementioned GBS subtypes. We apply 15 representative single classifiers from diverse approaches: decision trees (C4.5), instance-based learners (NN: nearest neighbor), kernel-based (SVM: Support Vector Machines), neural networks (SLP, MLP, and RBF-DDA), and rule induction learners (OneR, JRip), among others.

We performed experiments in three classification scenarios: four GBS subtypes’ classification, OvA (One versus All), and OvO (One versus One). For clarity purposes and due to page limitation, in this work we present detailed results of the first two scenarios. Details of OvO scenario will be available per reader request.

This study represents a comprehensive effort on creating a predictive model for GBS subtypes. Also, the analysis performed in this work provides insight about the best single classifiers for each classification case. Further experiments with other algorithms will follow.

This paper is organized as follows. In Section 2, we present a description of the dataset, the metrics used in the study, a brief description of the classifiers, the experimental design, and the tuning procedure of classifiers. In Section 3, we show and discuss the experimental results. Finally, in Section 4, we summarize conclusions of the study and also suggest some future work.

2. Materials and Methods

2.1. Data

The dataset used in this work comprises 129 cases of patients who received treatment at Instituto Nacional de Neurología y Neurocirugía located in Mexico City. There are 20 AIDP cases, 37 AMAN, 59 AMSAN, and 13 Miller-Fisher cases. Hence, there are four GBS subtypes in this dataset.

In a previous work [8], we identified a set of 16 relevant features out of an original 365-feature dataset. The features are listed in Table 1. Features V22, V29, V30, and V31 are all clinical and the remaining features come from a nerve conduction test. The method used to identify these 16 features is briefly described below.

First, we made a preselection of variables from the original dataset based on diagnostic criteria for GBS established in the literature. After preselection, the dataset was left with 156 variables: 121 variables from the nerve conduction test, 4 variables from the CSF analysis, and 31 clinical variables. We used a novel method consisting of a combination of Quenching Simulated Annealing (QSA) and Partitions Around Medoids (PAM), named QSA-PAM method. We used a clustering technique as this method is useful to study the internal structure of data to disclose the existence of groups of homogeneous data. We know in advance of the existence of four GBS subtypes or classes in our dataset; therefore, we took advantage of this information to identify relevant features that allow building four clusters, each corresponding to a GBS subtype. Purity metric was used to determine the quality of each cluster. The highest purity is reached when clusters contain the largest number of elements of the same type and the fewest number of elements of a different type.

QSA [9] is a version of Simulated Annealing (SA), which is a general purpose randomized metaheuristic that finds good approximations to the optimal solution for large combinatorial problems. QSA was used to select different random feature subsets from the dataset. New datasets created using these feature subsets were used as input to PAM to build four clusters. Finally, purity of clusters was measured. Sixteen features from the original dataset were encountered relevant for identifying GBS subtypes with the highest purity, 0.8992.

2.2. Single Classifiers

In this study, we include results from 15 representative single classifiers from diverse approaches: decision trees (C4.5), instance-based learners (NN: nearest neighbor), kernel-based (SVM: Support Vector Machines), neural networks (SLP, MLP, and RBF-DDA), rule induction learners (OneR, JRip), and logistic regression, among others. The complete list is given in Table 2, where the tuning parameters are also shown. A detailed description of these classifiers can be found in [1018]. Experiments from NN, SVM, and C4.5 were previously published [19, 20]. In this work, we used results from these classifiers to make a comparative analysis among all the 15 classifiers. From each approach, we selected the best classifiers based on their performance. The idea is to initially explore different single classifiers to analyze their performance in GBS subtype classification. From the machine learning perspective, it is always useful to analyze the classification power of different classifiers in diverse tasks.

2.3. Performance Measures

We used typical performance measures in machine learning such as AUC (Area under the Curve), average accuracy and balanced accuracy. Average accuracy is used in four GBS subtypes’ classification, since it is a more suitable measure for multiclass classification problems. Balanced accuracy is used in OvA classification, because it is a better performance estimate of imbalanced datasets.

2.4. Experimental Design

We used the 16-feature subset, described in Section 2.1, for experiments. We added the GBS subtype as class variable. Finally, we created a dataset containing the 129 instances and 17 features. We used 10-fold cross validation (10-FCV) evaluation schemes in all cases. We chose this validation scheme since it is more suitable due to our limited dataset. We performed 30 10-FCV runs, for each method listed in Section 2.2. For each fold we computed average accuracy (balanced accuracy for OvA) and AUC (multiclass AUC for four GBS subtype classification). After the 10-fold, we calculated the average of each measure. Finally, we averaged each of these quantities across the 30 runs. In each 10-FCV run, we set a different seed to ensure different splits of train and test sets across runs, then we had all classifiers use the same seed at the same run. These seeds were generated using Mersenne-Twister pseudo-random number generator [21].

We performed experiments in three classification scenarios: four GBS subtypes’ classification, OvA, and OvO. For clarity purposes and length of paper, in this work we present detailed results of the first two scenarios. Details of OvO scenario will be available per reader request.

In the first scenario, the four GBS subtypes were included in the dataset at the same time, that is, AIDP, AMAN, AMSAN, and MF. OvA strategy consists of building binary classifiers. In this particular work, we made four different OvA classifications, as the number of GBS subtypes in the dataset. Hence, we created four new datasets. In each dataset, the instances of one class were marked as the positive cases and the instances of the remaining classes were marked as the negative cases. OvO strategy consists of building binary classifiers. In this particular work, we made six different OvO classifications. Therefore, we created six new datasets, as many combinations of pairs of GBS subtypes. We aimed to investigate how well classifiers distinguish each subtype with respect to the other subtypes. Each dataset contained instances of only two GBS subtypes, one class marked as the positive case and all remaining classes as the negative case.

3. Parameter Optimization

MLP, SLP, RBF-DDA, and JRip each require a particular parameter optimization, as mentioned in Section 2.2. These parameters were automatically optimized by each method in each one of the 30 runs; therefore the best parameters for each run were used for classification.

4. Results

4.1. Four GBS Subtypes’ Classification

In this section, we show the classification results of the four GBS subtypes. All tables show the average results of each classifier across 30 runs. All figures show the average accuracy of each classifier across 30 runs. In both cases, the standard deviation of each metric is shown.

Table 3 shows the four GBS subtypes’ classification. Six classifiers, almost half of all the classifiers, obtained an average accuracy above 0.90. The best classifiers were NN, SVMLap, SVMPoly, SVM SVMGaus, C4.5, and SVMLin. Five of the remaining classifiers obtained an average accuracy around 0.89. Two around 0.88 and OneR showed the worst performance with an average accuracy under 0.80 and overall poor results in all metrics.

As Figure 1 show, most of the classifiers obtained an average accuracy around 0.90. Six of them were above this number, and in average, the standard deviation was around 0.01.

4.2. OvA Classification Results

In this section, we describe the results of OvA classification. That is, AIDP versus ALL, AMAN versus ALL, AMSAN versus ALL, and MF versus ALL. As mentioned before, we used the balanced accuracy as our base metric in OvA classification scenario. All tables and figures show the average results of each classifier across 30 runs. In all cases, the standard deviation of each metric is shown.

Table 4 shows the average results across 30 runs in OvA classification. In AIDP versus ALL, four classifiers obtained a balanced accuracy above 0.80: NN, C4.5, MLP, and SVMLap. In AMAN versus ALL, nine classifiers obtained a balanced accuracy above 0.90; four of them were SVM with all different kernels. In AMSAN versus ALL, the best five classifiers were NN, C4.5, SVMLap, SLP, and RBF-DDA. They obtained a balanced accuracy above 0.86. MF versus ALL obtained the worst classification performance.

As shown in Figure 2, AMSAN versus ALL showed the most stable performance both in balanced accuracy and in standard deviation across 30 runs. The opposite case was MF versus ALL. AMAN versus ALL obtained the highest classification performance; AMSAN versus ALL was the second best.

4.3. Statistical Analysis

We investigated if there was any statistically significant difference in average accuracy among the top five classifiers in average accuracy (balanced accuracy in OvO and OvA scenarios) across 30 runs, in all classification scenarios. For this analysis, we used the Friedman test. An additional post hoc analysis using Holms’ correction was performed in cases where null hypothesis was rejected. We selected Friedman test since it is suitable for the type of analysis we performed and also because it is a nonparametric test; that is, no assumption about data distribution is needed. Holm’s correction is used for controlling the family-wise error in multiple hypothesis testing. Despite other correction procedures, we selected Holm’s because it is a powerful method and it makes no additional assumption about the hypotheses tested. More details about these tests can be found in [22].

Post hoc analysis uses an alpha parameter, which is the modified alpha value equal to alpha/(), where alpha is the significance level, is the number of classifiers, and is the rank. In all tests, we used an alpha 0.05.

4.3.1. Four GBS Subtypes’ Classification

In Table 5 we show the Friedman test results of the comparison among the top five classifiers in average accuracy across 30 runs, in four classes classification. The complete list of the top five classifiers for each case is shown in Table 6. No statistically significant difference among the top five classifiers in average accuracy across 30 runs was found.

In all cases, we used as our null hypothesis o: there is no statistically significant difference in the average accuracy among the top five classifiers across 30 runs, and we used as our alternative hypothesis : there is a statistically significant difference in the average accuracy among the top five classifiers across 30 runs.

In Table 6 we show the average ranks for the top five classifiers in four GBS subtypes’ classification. As mentioned before, no statistically significant difference among the top five classifiers was found.

In Table 7, the results of the post hoc test with Holm’s correction of the top five classifiers in four GBS subtypes’ classification are shown. No statistically significant difference between SVMPoly and the other four classifiers was found.

4.3.2. OvA Classification

In Table 8 we show the Friedman test results of the comparison among the top five classifiers in balanced accuracy across 30 runs, in OvA classification. The complete list of the top five classifiers for each case is shown in Table 9. A statistically significant difference among the top five classifiers in balanced accuracy across 30 runs was found in all OvA classifications.

In all cases, we used as our null hypothesis o: there is no statistically significant difference in the balanced accuracy among the top five classifiers across 30 runs, and we used as our alternative hypothesis : there is a statistically significant difference in the balanced accuracy among the top five classifiers across 30 runs.

In Table 9 we show the average ranks for the top five classifiers in OvA classification. We highlight the ranked first classifiers only in cases where a statistically significant difference was found. The ranked first classifiers were MLP (2.17) for AIDP versus ALL, SVMPoly (2.37) for AMAN versus ALL, NN (1.40) for AMSAN versus ALL, and Naive Bayes (1.20) for MF versus ALL.

In Table 10, the results of the post hoc test with Holm’s correction of the top five classifiers in AIDP versus ALL classification are shown. A statistically significant difference between MLP and LDA was found, as well as between MLP and NN.

In Table 11, the results of the post hoc test with Holm’s correction of the top five classifiers in AMAN versus ALL classification are shown. A statistically significant difference between SVMPoly and MLP was found.

In Table 12, the results of the post hoc test with Holm’s correction of the top five classifiers in AMSAN versus ALL classification are shown. A statistically significant difference between NN and the rest of classifiers was found.

In Table 13, the results of the post hoc test with Holm’s correction of the top five classifiers in MF versus ALL classification are shown. A statistically significant difference between Naive Bayes and the rest of classifiers was found.

5. Discussion

Our objective in this work was to create the highest accurate predictive model for GBS possible, using the 16 relevant features identified with QSA-PAM method. This work constitutes the first effort on this topic using machine learning methods. For this first approach, we used single classifiers. We selected 15 single classifiers from diverse types: decision trees (C4.5), instance-based learners (NN), kernel-based (SVM), neural networks (SLP, MLP, and RBF-DDA), and rule induction learners (OneR, JRip), among others. The complete list is in Section 2.2. We compared their performance in three types of experiments: four GBS subtypes’ classification, OvA classification, and OvO classification.

5.1. Four GBS Subtypes’ Classification

The best classifiers were NN and SVM with all different kernels and C4.5. This result confirms them as a good single classifier. The standard deviation of the average accuracy was low; this could be a consequence of the cross validation characteristic in the sense of reducing the variance by averaging over different partitions.

OneR obtained the worst performance. One possible explanation of this situation is that, since OneR generates one single rule to make the classification, maybe that single rule is not enough to classify the four classes in this particular problem.

After the statistical analysis, no statistical significant difference in average accuracy among the top five classifiers across 30 runs was found. One possible explanation for this last result could be the stability in average accuracy achieved by the classifiers in 10-FCV.

5.2. OvA Classification

AMAN versus ALL showed the highest performance in balanced accuracy across 30 runs. The opposite case was MF versus ALL. AMSAN versus ALL was the second best. The two classes with the largest number of instances resulted in the best classification results, that is, AMAN versus ALL and AMSAN versus ALL. NN, C4.5, and SVMLap appear in the top five classifiers in most cases. Naive Bayes appears as the top classifier for MF versus ALL. In most cases, OneR obtained the worst performance. Overall, the highest and more stable average results across 30 runs were obtained in 10-FCV scenario.

After the statistical analysis, two classifiers stood out from the rest. Naive Bayes resulted as the best classifier for the minority class versus ALL, that is, MF versus ALL. NN was the best classifier for AMSAN versus ALL.

6. Conclusions

In this work, we aimed at creating the highest accurate predictive model for GBS possible, using the 16 relevant features identified with QSA-PAM method. This work constitutes the first effort on this topic using machine learning methods. Using a reduced set of predictors for GBS subtypes could result in applying simpler and faster medical tests.

For this first approach, we used single classifiers. We selected 15 single classifiers from diverse types: decision trees (C4.5), instance-based learners (NN), kernel-based (SVM), neural networks (SLP, MLP, and RBF-DDA), and rule induction learners (OneR, JRip), among others. The complete list is in Section 2.2. We compared their performance in three types of experiments: four GBS subtypes’ classification, OvA classification, and OvO classification. However, in this work we only present results from the two first scenarios.

In four GBS subtypes’ classification, we obtained an average accuracy 0.90 with half of classifiers investigated. In OvA classification, the two classes with the largest number of instances resulted in the best classification results, that is, AMAN versus ALL and AMSAN versus ALL. Although some classifiers stand out from the rest, as mentioned in Discussion, each classification scenario obtained a best classification method. The analysis performed in this work provides insight about the best classifiers for each classification case. Furthermore, from the machine learning perspective, it is always useful to analyze the classification power of different classifiers in diverse tasks.

This study is limited with regard to the quantity of instances present in the dataset. Another limitation is the absence of other GBS datasets to compare with our results.

As future work, we will investigate the performance of ensemble methods. Also, we will further tackle the imbalanced data problem.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partially supported by Universidad Juárez Autónoma de Tabasco and Consejo Nacional de Ciencia y Tecnología (CONACYT). The authors would like to thank Dr. Juan José Méndez Castillo for providing the dataset used in this study.