Abstract

Neurodegenerative diseases that affect serious gait abnormalities include Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS), and Huntington disease (HD). These diseases lead to gait rhythm distortion that can be determined by stride time interval of footfall contact times. In this paper, we present a new method for gait classification of neurodegenerative diseases. In particular, we utilize a symbolic aggregate approximation algorithm to convert left-foot stride-stride interval into a sequence of symbols using a symbolic aggregate approximation. We then find string prototypes of each class using the newly proposed string grammar unsupervised possibilistic fuzzy C-medians. Then in the testing process the fuzzy k-nearest neighbor is used. We implement the system on three 2-class problems, i.e., the classification of ALS against healthy patients, that of HD against healthy patients , and that of PD against healthy patients. The system is also implemented on one 4-class problem (the classification of ALS, HD, PD, and healthy patients altogether) called NDDs versus healthy. We found that our system yields a very good detection result. The average correct classification for ALS versus healthy is 96.88%, and that for HD versus healthy is 97.22%, whereas that for PD versus healthy is 96.43%. When the system is implemented on 4-class problem, the average accuracy is approximately 98.44%. It can provide prototypes of gait signals that are more understandable to human.

1. Introduction

Neurodegenerative diseases (NDDs) are the diseases of neuronal destruction in the central nervous system. The NDDs cause the volume of the brain and the amount of nerve deterioration over time. The diseases reduce the ability of patient and destroy tissue and nerves of the brain because nerves or neurons in the brain normally cannot reproduce themselves. Some neurodegenerative disorders such as Parkinson’s disease (PD), Huntington disease (HD), and amyotrophic lateral sclerosis (ALS) usually occur at an older age and can lead to serious gait abnormalities [1]. Since balancing and sequencing of movement are controlled by the central nervous system, the gait of patient with neurodegenerative disorders will become abnormal. The main symptoms of PD are legs trembling, slowed moving, and impaired posture and balance. It may grow worse over time [2]. The main symptoms of HD are mood change, coordination of muscles problem, uncontrolled movement, and difficulty in walking. The patient with HD may lose their intellectual and behavioural abilities and may also experience psychiatric symptoms [3]. For ALS patient, a part of nerve cells that control muscle function is destroyed. Characteristic of this disease is continuous muscle atrophy. It causes muscle weakness and tenderness. The general symptoms in ALS are difficulty in walking, swallowing, breathing, and speaking [4]. In [5], they found that the patients with neurodegenerative diseases had decreased stride length as compared to healthy control subjects. From above reasons, the stride-to-stride of gait information is utilized for gait pattern classification in patients with neurodegenerative diseases because of the gait pattern difference between healthy and NDD subjects.

In recent related studies, the information from time series of stride intervals, swing intervals, and stance intervals of stride-to-stride is utilized to classify the gait pattern of the patient with NDDs and healthy control subjects. Some research works involved detecting either PD or ALS only [9, 13, 14]. Some of them involved HD, ALS, and PD classification [8, 1012]; however, the information from left and right feet is used in the system. A few of them utilized only right-foot information to classify HD, ALS, and PD [7]; however, this method only detected a patient with one disease against a healthy patient, not finding a patient with one of the diseases against a healthy patient. All previous researches utilized a regular numeric classifier, e.g., the support vector machine and classifier. Hence, these methods cannot provide a prototype signal for each disease.

In this paper, we propose the syntactic method for gait pattern classification from time series information. In particular, we introduce a string grammar unsupervised possibilistic fuzzy C-medians (sgUPFCMed) to recognize PD, ALS, and HD from the left-foot stride interval. It is worthwhile noting that the sgUPFCMed is a brand new algorithm proposed by our research group. It is a part of the recent doctoral thesis of one of our group members [6] and has never been published elsewhere. In the thesis, it was implemented on some standard data sets that are syntactic data set by nature, e.g., the Copenhagen chromosomes data set [1517], the MNIST database of handwriting digit data set from http://algoval.essex.ac.uk/data/sequence/ as described in [1821] collected by Professor Simon M. Lucas, and the USPS handwritten digit data set collected by Professor Simon M. Lucas and downloaded from http://algoval.essex.ac.uk/data/sequence/ [1821]. Example from each data set is shown in Figure 1. The histogram of each image in the Copenhagen chromosomes data set was encoded into a string. It should be noted that we downloaded the encoded data set, not the images in these three data sets. The experiment results on both 10-fold cross validation and the blind test data sets from all three data sets are shown in Table 1. This shows that the algorithm is capable of classifying syntactic data set and also providing good classification results.

Since our algorithm is not a numeric classifier but a syntactic classifier, we transform the gait time series into a string using the symbolic aggregate approximation (SAX) [22]. The sgUPFCMed is utilized to find a string prototype(s) for each disease. Then the fuzzy k-nearest neighbor [23] is utilized to find the best match for a test data sample. The paper is structured as follows. The description of the NDDs detection system is introduced in Section 2. The results of gait classification are shown in Section 3. Finally, we draw the conclusion in Section 4.

2. System Description

In this section, we introduce the details of our system for gait pattern classification of patients with neurodegenerative diseases (NDDs). We take the gait data set from gait dynamics in neurodegenerative disease database (http://www.physionet.org/physiobank/database/gaitndd/). This data set consists of 64 subjects from 15 subjects with PD, 20 subjects with HD, 13 subjects with ALS, and 16 healthy control subjects [24]. Subjects were requested to walk along a 77-meter-long hallway for 5 minutes without stopping. Force-sensitive switches underneath each subject’s feet were recorded at 300 Hertz sampling rate. From the recorded force, the time series of the stride time, stance time, and swing time were derived. To eliminate the startup effects, we follow the same method in [25]. The first 20 values of each samples are removed. The 3-SD median filter is utilized for eliminating the outliers that are far away from the median value [25]. The raw data are obtained using force-sensitive resistors, with the output roughly proportional to the force under the foot. Stride-to-stride measures of footfall contact times are derived from these signals as shown in Figure 2. In the experiment, we only use left-foot stride-to-stride interval data set. The proposed scheme of the detection system is shown in Figure 3. We transform each time series data into a sequence string using the symbolic aggregate approximation (SAX) representation [22] to convert any time series into a sequence of symbols. The gait time series of length n is converted into its Piecewise Aggregation Approximation (PAA) (a vector of w-dimensional space ()) usingThe time series data () is normalized into a series data with 0 mean and 1 standard deviation. Then it is divided into several frames with the size of w and each frame is converted to PAA data (). Then each (for ) is mapped into a symbol. In our experiment, w is set to be equal to the length of the time series. There are 8 symbols used in the experiment. Example of the string generation is shown in Figure 4. In this figure the gait time series is transformed to “fbfdbcaddfgh……dffhdd”.

Now, we are ready to create prototypes with the string grammar unsupervised possibilistic fuzzy clustering (sgUPFCMed). The sgUPFCMed is a modified version of the unsupervised possibilistic fuzzy C-means (UPFCM) [26], a combination of the possibilistic fuzzy C-means (PFCM) [27] and the unsupervised possibilistic clustering (UPCM) [28]. It is to solve the problem of generating coincident clusters of the UPCM. The UPFCM is developed based on the characteristics of both fuzzy and possibilistic C-means. Hence, the UPFCM should be able to deal more effectively with noise, overlapping, and outliers. Since the sgUPFCMed is modified from the UPFCM, it should have the same properties as the UPFCM. The brief description of the algorithm is as follows. Assume be a set of strings. Each string is a sequence of symbols (primitives). For example, , a string with length , where each is a member of a set of defined symbols or primitives. Suppose represents a -tuple of string prototypes, each of which characterizes one of the clusters. ] is the Levenshtein distance [2932] between string and string prototypes . U is a membership matrix and T is a possibilistic matrix . The objective function of the sgUPFCMed iswhere is the membership value of string in the cluster , is the possibilistic value of string in the cluster , is the fuzzifier (normally ), , , , , for , and . is defined as the sample covariance [23] based on the Euclidean distance. Since our data set is a string data set, the calculation of will bewhere Med is the median string of the data set; i.e., The theorem for the sgUPFCMed and its corresponding proof are shown in Theorem 1. This theorem shows that the update equation of a membership value of string in cluster () (5) and the update equation of a possibilistic value of string in cluster () (6) give the minimum value of the objective function ().

Theorem 1 (sgUPFCMed). If for all and , when , , , and contains distinct string data, then is minimized only if the update equation of isand the update equation of is

Proof. From the Lagrange multiplier theorem, (5) is obtained by solving the reduced problem where T and V are fixed for the k-th column of U. The proof of this equation is similar to that in [23]; hence, it is obvious and easy to prove (5).
Similarly, when U and V are fixed for the i-th row of T, (6) is proved by solving the problem . The derivative of with respect to and setting it to zero leads to

To update a cluster center, we utilized the fuzzy median string [23, 3336] as follows:However, it has been proved in [35, 36] that the modified median string provides a better classification than the regular median string. Hence, in [23, 3336], the modified fuzzy median string is used. Let be the free monoid over the alphabet set Σ and a set of strings . Then, the modified fuzzy median, i.e., an approximation of fuzzy median using edition operations (insertion, deletion, and substitution) over each symbol of the string, will beThe cluster center update equation of the sgUPFCMed is shown in Algorithm 1.

Start with the initial string .
For each position in the string
(1) Build alternative
Substitution: Set . For each symbol
  (a) Set to be the result of substituting symbol with symbol .
  (b) If ,
    then set .
Deletion: Set y to be the result of deleting the symbol of .
Insertion: Set . For each symbol
  (a) Set to be the result of adding a at position of .
  (b) If ,
    then set .
(2) Choose an alternative
 Select string from the set of strings from step (1) using
     .   Then set .

The sgUPFCMed algorithm is summarized in Algorithm 2.

Store unlabeled finite strings =
Initialize string prototypes for all C classes
Set , , ,
Compute using fuzzy median equation (3)
Do
  Compute Levenshtein distance between input string and cluster prototype
  Update membership value using equation (5)
  Update possibilistic value using equation (6)
  Update center string of each cluster () using equation (10) and (11)
Until (stabilize)

Afterwards, the multiprototype generation, i.e., , where is string prototype of class , is created. The fuzzy k-nearest neighbor (FKNN) [23, 37] is used as a classifier. The membership value of string in class iswhere is the membership value of the prototype from class in class , is the number of classes, and is the number of nearest neighbors. The decision rule for the test string isBecause the class of each prototype is known, we set membership value to 1 for in class and zero membership values in all other classes.

3. Experiment Results

We implement three 2-class problems, i.e., the classification of ALS against healthy patients, HD against healthy patients, and PD against healthy patients. We also implement one 4-class classification, i.e., the classification of all three NDDs diseases (ALS, HD, and PD) against healthy patients. In all of the experiments, we implement 4-fold cross validation to evaluate our proposed algorithm. The parameters and are set to 2, and the parameters and are set to 1 and 6, respectively. These parameters are chosen based on trial and error method from an extensive experiment. The stopping criteria of the sgUPFCMed are set to 0.01 with the maximum number of iterations of 100. To create multiprototype of each class, the sgUPFCMed is used to cluster each class with 2, 3, 4, and 5 number of clusters. In the testing process, the FKNN is utilized with = 1, 3, and 5. Tables 25 show the average and the standard deviation of the classification rate on the validation set for the ALS versus healthy, HD versus healthy, PD versus healthy, and NDDs versus healthy. The best validation result from the ALS is 96.875±6.250% when there are 3 prototypes for each class and 1 nearest neighbor, while that from the HD is 97.222±5.556% with 2 prototypes for each class and 1 nearest neighbor. The best result from the PD is 96.429±7.143% with 2 prototypes and 1 nearest neighbor. For all three NDDs classes versus healthy patient, the best result is again 2 prototypes and 1 nearest neighbor with the classification rate of 98.437±3.125%. The sensitivity and specificity of the best model in ALS, HD, PD, and NDDs are shown in Table 6. Figures 58 show time series that are closest to prototypes of the best model of the ALS, HD, PD, and NDDs classification experiment, respectively. We can see that the shape of each prototype is not exactly similar to the others. Although, there are some overlapping between prototypes of the disease gait signal and the healthy gait signal, the detection system can provide a good classification rate. For example, in Figure 6, the prototypes of HD gait signals are overlapped with that of the healthy control prototypes.

However, the shapes are different. The string sequences will be different as well. Hence, the classification result is close to 100%. We also compare our results indirectly with the existing methods as shown in Table 7. We can see that our results are better than the numeric algorithms in all the cases except PD and HD classification in 2-class problem and NDDs in 4-class problem. However, the algorithm in [8] was implemented using all-train-all-test whereas our result is based on the validation set only. The algorithm in [12] used several features while our system only uses left-foot stride-to-stride interval. Moreover, our system can provide the shapes of prototypes that might be more understandable to user than the numeric algorithms.

4. Conclusions

In this paper, the NDDs, i.e., Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS), and Huntington Disease (HD), detection system is introduced. In particular, the NDDs left-foot gait time series (left-foot stride-stride interval) is transformed into a sequence of strings. The string grammar unsupervised possibilistic fuzzy C-medians (sgUPFCMed) first introduced in this paper is utilized to generate prototypes of each disease. Then the fuzzy k-nearest neighbor is used as a classifier in the testing process. We found that the best validation results of the 2-class problem, i.e., ALS versus healthy patient, HD versus healthy, and PD versus healthy, are 96.88±6.25%, 97.22±5.56%, and 96.43±7.14%, respectively. For the 4-class problem (three NDDs versus healthy), the best classification rate is 98.44±3.13%. From the indirect comparison, we found that our algorithm performs better than the existing algorithms on average. In addition, our system can provide the prototype signal that is more understandable to human than the previous methods that are based on numeric algorithm.

Data Availability

The data set is downloaded from http://www.physionet.org/physiobank/database/gaitndd/. It is a public data set provided by physionet.org.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Thailand Research Fund and Chiang Mai University under the Royal Golden Jubilee Ph.D. Program (Grant no. PHD/0044/2555) for financial support.