#### Abstract

A reliable fault diagnostic system for gas turbine generator system (GTGS), which is complicated and inherent with many types of component faults, is essential to avoid the interruption of electricity supply. However, the GTGS diagnosis faces challenges in terms of the existence of simultaneous-fault diagnosis and high cost in acquiring the exponentially increased simultaneous-fault vibration signals for constructing the diagnostic system. This research proposes a new diagnostic framework combining feature extraction, pairwise-coupled probabilistic classifier, and decision threshold optimization. The feature extraction module adopts wavelet packet transform and time-domain statistical features to extract vibration signal features. Kernel principal component analysis is then applied to further reduce the redundant features. The features of single faults in a simultaneous-fault pattern are extracted and then detected using a probabilistic classifier, namely, pairwise-coupled relevance vector machine, which is trained with single-fault patterns only. Therefore, the training dataset of simultaneous-fault patterns is unnecessary. To optimize the decision threshold, this research proposes to use grid search method which can ensure a global solution as compared with traditional computational intelligence techniques. Experimental results show that the proposed framework performs well for both single-fault and simultaneous-fault diagnosis and is superior to the frameworks without feature extraction and pairwise coupling.

#### 1. Introduction

The gas turbine generator system (GTGS) is commonly used in many power plants. The main components of a GTGS are power turbine, gearbox, flywheel, and asynchronous generator. In the first phase of the GTGS, the power turbine is driven by the exhaust gas; the output of the turbine then drives a gearbox that connects to the flywheel which keeps constant moment of inertia to protect the generator from a sudden stop. Finally, the rotating flywheel drives the asynchronous generator to generate the electric power. The system is designed to run 24 hours per day. Any abnormal situation of the GTGS will interrupt the electricity supply to cause enormous economic loss. The traditional manual inspection on the GTGS can hardly accomplish the fault-monitoring task because the GTGS is complicated and also more than one fault may appear at a time. This kind of problem refers to simultaneous-fault diagnosis. To prevent the interruption of electricity supply, development of an intelligent fault diagnostic system for single and simultaneous faults of the GTGS is a promising research topic.

According to the existing literature, no physical diagnostic model for simultaneous-fault diagnosis about power generator system like GTGS is available. One possible solution for intelligent fault detection is to use machine learning methods, like neural networks (NNs) and support vector machines, to learn the normal and abnormal vibration signal patterns. A classifier is then built for fault detection of unseen signal patterns.

In recent years, researches on the development of neural network-based monitoring systems for the GTGS and rotating machinery were available [1–4]. However, the neural network classifiers have many drawbacks, such as local minima, time consuming for determination of optimal network structure, and risk of overfitting. To date, a number of researchers have already applied support vector machines (SVMs) to diagnose rotating machine faults and other engineering diagnosis problems [5–13] and have shown that SVM is superior to traditional NNs [8, 10, 11]. The major advantages of SVM are global optimum and higher generalization capability [8, 11]. Nevertheless, those classification systems can detect single fault only. Even though the simultaneous fault can also be detected by the concept of single-label classification, in which both single and simultaneous-faults are considered as independent labels, they need many simultaneous-fault training patterns. However, transformation of simultaneous-fault diagnosis to single-label classification suffers from two drawbacks. Firstly, the transformation significantly increases the required number of the training data because a set of new classes will be artificially generated via combination of single faults, which makes it impractical to handle industrial problems in medium to large level of faults. For example, with single-faults (labels), there are artificial simultaneous-fault labels. Each label requires a certain amount of training dataset. So, it is impractical and prohibitive because acquiring all the possible simultaneous-fault patterns is costly and hard to collect. Secondly, it is very difficult to include new faults in the system later. It is because of the addition of one single fault that the number of combination of single faults for constructing the artificial simultaneous-fault labels will be increased exponentially so as to require a huge amount of corresponding training patterns too. Another way to solve the problem of simultaneous-fault diagnosis is the framework called hierarchical artificial neural network (HANN) which is constructed by several stages of NNs. Each of the HANN stage is constructed with a set of NNs as described in [2, 14–16]. HANN also suffers from the shortcoming that its architecture will become too complex to handle when extending to medium or large scale. With the increase of the number of single-faults, a large amount of NNs must be constructed in every HANN stage.

To overcome the aforesaid limitations, one possible way is to develop a classifier which can diagnose both single and simultaneous faults, while it is trained with single fault data only. This idea refers to multilabel classification. It is believed in the area of GTGS that the features of single-fault signal patterns can be found from the simultaneous-fault patterns, if a proper feature extraction technique can be selected. Should this hypothesis be true, it is possible to develop such kind of classifier by using proper feature extraction techniques. Therefore, one of the originalities of this research is to explore the feasibility of detection of simultaneous-fault vibration patterns of rotary machinery by a classifier which is only trained with single-fault patterns. To exam the feasibility, experimental evaluation is proposed in this study.

As far as feature extraction is concerned, there are currently several feature extraction techniques for fault detection of rotary machinery, including wavelet packet transform (WPT) [4, 8, 17], independent component analysis [6], and time-domain statistical features (TDSF) [13, 18]. By reviewing the literature, WPT and TDSF are commonly used for rotating machinery so that both techniques are considered in this research. After performing feature extraction, there may be still some irrelevant and redundant information in the extracted features. It is well known that if the number of inputs of a fault classifier is too huge, its accuracy will be degenerated unless a huge number of training data is available, but it is impractical. To resolve this problem, a feature selection method should be employed to wipe off irrelevant and redundant information such that the number of inputs of the classifier can be reduced, resulting in improvement on diagnostic accuracy. Currently, there are some feature selection approaches available, including compensation distance evaluation technique (CDET) [18], kernel principal component analysis (KPCA) [19], and genetic algorithm (GA)- based methods [10, 20]. Although CDET and GA-based methods provide a good solution, the optimal threshold in CDET is difficult to set, and the result of GA is unrepeatable. In other words, when a GA is run for two times, two different results will be obtained. In this way, KPCA is considered in this study to reduce the dimensional scale of the information content. The term of feature extraction hereafter refers to feature extraction and feature selection.

From the practical point of view, a proper classifier has to offer the probabilities of all possible faults. Then the user can at least trace the other possible faults according to the rank of their probabilities when the predicted fault(s) from the classifier is (are) incorrect in the problem. Moreover, it is our belief that comparing the similarity of a simultaneous-fault input pattern with every single fault pattern is the main mechanism for simultaneous-fault detection without learning any simultaneous-fault training data. One possible way to represent the degree of similarity of two patterns is the probability of likeness. Therefore, it is logical to employ probabilistic classifier for simultaneous-fault diagnosis because they can determine the probability of each label. Typically, probabilistic neural network (PNN) [21, 22] was employed as a probabilistic classifier. It was shown in [21] that the performance of PNN is superior to SVM-based method for multilabel classification. However, the main drawback of PNN lies in the limited number of inputs because the complexity of the network and the training time are heavily related to the number of inputs. Recently, Widodo et al. [23] proposed to apply an advance classifier, namely, relevance vector machine (RVM), to the fault diagnosis of low speed bearings. They showed that RVM is superior to SVM in terms of diagnostic accuracy. RVM is a statistical learning method proposed by Tipping [24], which trains a probabilistic classifier with sparser model using Bayesian framework. RVM can be extended to multiclass version using one-versus-all (1vA) strategy. However, this strategy was verified to produce a large region of indecision [25, 26]. In view of this drawback, this research proposes to incorporate pairwise coupling, that is, one-versus-one (1v1) strategy, into RVM, namely, pairwise-coupled relevance vector machine (PCRVM). As the pairwise coupling strategy considers the correlation between every pair of fault labels, a more accurate estimate of label probabilities for simultaneous-fault signals can be achieved. The detailed explanation of the advantage of 1v1 over 1vA is discussed in Section 2.3.

If a probabilistic classification is applied to fault detection, the predicted fault is usually inferred as the one with the largest probability. The other alternative approach is that the probabilistic classifier ranks all the possible faults according to their probabilities and lets the engineer make a decision. These inference approaches work fine with single-fault detection but fail to determine which faults occur simultaneously in the simultaneous-fault problem. It is because the engineer cannot identify the number of simultaneous faults based on the output probability of each label. For instance, an output probability vector for five labels is given as . In this example, it is difficult for the engineer to judge whether the simultaneous faults are labels 2, 3, and 5. To identify the number of simultaneous faults, a decision threshold must be introduced, and thus a step of decision threshold optimization is proposed in the current framework other than feature extraction and probabilistic classification.

As a summary, a framework combining feature extraction (WPT + TDSF + KPCA), pairwise-coupled relevance vector machine (PCRVM), and effective decision threshold optimization is proposed for simultaneous-fault diagnosis of the GTGS. Even though the authors also proposed a similar framework for simultaneous fault diagnosis of automotive engine ignition systems [27], the framework was designed for simple waveforms, whereas the signal patterns of this application are high frequency oscillating signals. Moreover, one of the classification criteria and decision threshold optimization method in [27], respectively, rely on domain-specific knowledge and computational intelligence techniques, such as genetic algorithm (GA) and particle swarm optimization (PSO), which may produce a local optimal solution. Therefore, the framework in [27] cannot be directly applied; it should be modified significantly, particularly in the phases of feature extraction and the threshold optimization.

This paper is organized as follows. Section 2 presents the proposed framework and the related techniques. Experimental setup and sample data acquisition are discussed in Section 3. Section 4 discusses the experimental results of PCRVM and its comparison with typical PNN [21, 22], an existing single-label SVM classifier [28], and pairwise-coupled probabilistic neural network (PCPNN) in order to show the superiority of RVM and effectiveness of pairwise coupling strategy. Finally, a conclusion is given in Section 5.

#### 2. Proposed Simultaneous-Fault Diagnosis Framework

The proposed simultaneous-fault diagnostic framework for the GTGS and its evaluation approach are shown in Figure 1. The framework consists of four submodules: feature extraction, pairwise-coupled probabilistic classifier, model parameter and decision threshold optimization, and performance evaluation. The proposed framework can be applicable to PCPNN and PCRVM.

In the feature extraction submodule, the sample dataset is divided into three independent groups: validation dataset, training dataset, and test dataset. The validation dataset and test dataset involve the combination of both single-fault patterns and simultaneous-fault patterns, while the training dataset contains single-fault patterns only. WPT and TDSF are used for feature extraction, in which the training feature set, , validation feature set, , and the unseen vibration signal for testing, , are generated. Considering the existence of irrelevant and redundant information in the extracted features, KPCA is then applied to remove useless information and further reduce the dimensions of , , and . The results are saved as , , and , respectively. In order to ensure all the features having even contribution, every feature in , , and is normalized within . The processed training dataset, validation dataset, and unseen signal for testing are named as , , and , respectively.

The diagnostic model, using pairwise-coupled probabilistic classifier (PCPNN or PCRVM), is trained using the processed training dataset . The output of the trained classifier, together with processed validation dataset , is used to perform parameter optimization for the probabilistic classifier (i.e., the spread of PCPNN or the width of PCRVM) and the decision threshold. The optimized parameter for the classifier is used to construct the final PCRVM or PCPNN-based diagnostic model, and the optimal threshold is utilized to identify simultaneous-faults for the unseen test dataset, . The performance evaluation submodule adopts -measure to test . The details of the four submodules in the framework are discussed in the following subsections.

##### 2.1. Feature Extraction

###### 2.1.1. Wavelet Packet Transform

In the last two decades, wavelet transform (WT) has been widely applied in random signal processing. The transform of a signal is just another form of representing the signal. It does not change the information content presented in the signal. In WT, multiresolution technique is used, and different frequencies are analyzed with different resolutions in order to provide a time-frequency representation of the signal. Wavelet packet transform (WPT) derives from the WT family. WPT is a generalization of wavelet decomposition that offers a richer signal analysis. In the decomposition of a signal by DWT, only the lower frequency band is decomposed, giving a right recursive binary tree structure whose right lobe represents the lower frequency band, and its left lobe is the higher frequency band. In the corresponding WPT decomposition, the lower as well as the higher frequency bands are decomposed giving a balanced binary tree structure. Therefore, the WPT has the same frequency bandwidths in each resolution while the traditional discrete wavelet transform does not have this property. Therefore, WPT is suitable for processing of nonstationary signals, like the vibration signal, because the same frequency bandwidths can provide good resolution regardless of high and low frequencies.

###### 2.1.2. Kernel Principal Component Analysis

Principal component analysis (PCA) is a popular statistical method for principal component feature extraction. PCA always performs well in dimensionality reduction when the input variables are linearly correlated. However, for nonlinear cases, PCA cannot give a good performance. Hence, PCA is extended to nonlinear version under SVM formulation and is called kernel PCA (KPCA), which has been used to solve many application problems [29, 30]. KPCA involves solving in the following set of equations: where for , is the number of data for KPCA, , and is the dimension of the input data. The vector is the eigenvector of , and is the corresponding eigenvalue. The transformed variables (score variables) for vector become where is the th element in the eigenvector corresponding to the th largest eigenvalue, to , and is the largest number such that eigenvalue of the eigenvector is nonzero. Therefore, based on the pairs of , the input vector can be transformed to a nonlinearly uncorrelated variable where . One more point to note is that the eigenvectors should satisfy the normalization condition of unit length: where . To produce a further reduced feature vector, a postpruning procedure can be done. That is, after acquiring the pairs of , all are normalized to which satisfy the constraint . Based on these normalized eigenvalues , the smallest are deleted until. With the index , the eigenvectors ( to ) are selected to produce a reduced feature vector which retains 95% of the information content in the transformed features. Usually a 5% of information loss is a rule of thumb for dimensionality reduction.

##### 2.2. Relevance Vector Machine

Relevance vector machine (RVM) is a recently available machine learning method. Theoretically, RVM is a statistical learning method utilizing Bayesian learning framework and popular kernels. In this research, predicting the posterior probability of each fault for unseen symptoms is conducted by RVM based on experimental data. Given a set of training data , to , , and is the number of training data. It follows the statistical convention and generalizes the linear model by applying the logistic sigmoid function to the predicted decision and adopting the Bernoulli distribution for , and the likelihood of the data is written as [24] and are the adjustable parameters. is a radial basis function (RBF) since RBF kernel is usually adopted for classification problems.

The optimal weight vector for the given dataset needs to be computed so as to maximize the probability , with , a vector of hyperparameters. However, the weights cannot be determined analytically. Thus, the following approximation procedure is chosen, which is based on Laplace’s method.(a) For the current fixed values of , the most probable weights are found, which is the location of the posterior mode. Since , this step is equivalent to the following maximization: (b) Laplace’s method is simply a Gaussian approximation to the log posterior around the mode of the weights . Equation (5) is differentiated twice to give where is a diagonal matrix with and is an design matrix with and = 1, to and to . By inverting (6), the covariance matrix can be obtained.(c) The hyperparameter vector is updated using an iterative reestimation equation. Firstly, randomly guess , and calculate where is the th diagonal element of the covariance matrix . Then reestimate as follows: where . Set , and reestimate and again until convergence. Then is estimated so that the classification model is obtained [27].

##### 2.3. Pairwise-Coupled RVM

The traditional RVM formulation is designed only for the issue of binary classification, in which the output is either positive (+1) or negative (−1). In order to resolve the current simultaneous-fault problem, multiclass strategies of one-versus-all and one-versus-one (pairwise coupling) can be employed. Traditionally, one-versus-all strategy constructs a group of classifiers in a -label classification problem. For any undefined input , the classification vector , where if or if . The one-versus-all strategy is simple and easy to implement. However, it generally gives a poor result [25, 26, 31] since one-versus-all does not consider the pairwise correlation and, hence, induces a much larger indecisive region than one-versus-one as shown in Figure 2. On the other hand, pairwise coupling (one-versus-one) also constructs a group of classifiers in a -label classification problem. However, each is composed of a set of different pairwise classifiers , . Since and are complementary, there are totally classifiers in as shown in Figure 3.

**(a)**

**(b)**

There are several available methods for pairwise coupling strategy [25], which are, however, unsuitable for simultaneous-fault diagnosis because of the constraint . Note that the nature of simultaneous-fault diagnosis is that is unnecessarily equal to 1. Therefore, the following simple pairwise coupling strategy for simultaneous-fault diagnosis is proposed. Every is calculated as where is the number of training feature vectors with either th or th labels. Hence, the probability can be more accurately estimated from because the pairwise correlation between the labels is taken into account. With the previous pairwise coupling strategy, the proposed framework, PCRVM, could estimate the probability vector in high accuracy level and, hence, generates a higher classification accuracy for simultaneous-fault diagnosis.

##### 2.4. Decision Threshold

The pairwise-coupled probabilistic classification could produce a probability vector , where is the number of the single-fault labels, indicating the probabilistic occurrence of every single-fault. This probability vector can be provided to the user as a quantitative measure for reference and further processing. Practically, it is desirable to identify the multilabel decision as a decision vector for all , which is, however, not directly available from the probability vector . By applying a decision threshold , could be derived from , that is, , where where is a decision threshold and indicates if belongs to the th label or not, to . For example, if and , then . Therefore, is diagnosed as a simultaneous-fault (). Note that the decision threshold is the major factor affecting the classification accuracy. The decision threshold is domain specific and sensitive to the diagnostic accuracy. The diagnosis of GTGS requires an efficient searching algorithm for optimal setting of the decision threshold. In this study, a simple and effective method, grid search (GS), is adopted. The search region for the decision threshold was set within the range of 0 to 1. By applying a reasonable small interval, say 0.01, a series of candidate thresholds were generated. After evaluating the performance index, -measure, of various candidate thresholds using the validation dataset, the best threshold can be obtained. Even though GS method is time-consuming, it can ensure a global solution, if the grids can cover the whole searching space. As a result, GS will not stick in a local optimal solution as compared with the computational intelligent techniques such as GA and PSO in [27, 32]. Thankfully, the threshold lies between 0 and 1, so the trial time is not very long.

##### 2.5. Evaluation

The traditional statistical measure of classification accuracy only considers exact matching of the decision vector against the true vector . This evaluation is, however, unsuitable for simultaneous-fault diagnosis where partial matching is preferred. Therefore, a well-known and common evaluation method called -measure [32–34] is employed. -measure is mostly used as a performance evaluation for information retrieval systems where a document may belong to a single or multiple tags simultaneously, which is very similar to our current study. By using -measure, the evaluation of both single-fault and simultaneous-fault test cases can be fairly examined. The definition of -measure is given in (10). The larger the -measure value, the higher the diagnostic accuracy will be where and are the predicted decision vector and the true decision vector, respectively, for to and to and for all . is the number of single-fault and simultaneous-fault test patterns.

##### 2.6. Summary of the Proposed Framework

The proposed framework and techniques are summarized in Figure 4. Figure 4(a) shows the workflow of combining technologies including WPT, TDSF, and KPCA as feature extraction submodule. Every dataset for training, validation, and testing is required to go through the feature extraction process. Figure 4(b) shows the architecture of the classifier , where the pairwise coupling is deployed as depicted in Figure 3. Then the classifier is passed to a grid search optimizer to search for the optimal parameter of the classifier (i.e., for PCRVM/RVM or for PNN/PCPNN) and the decision threshold based on the validation set () and -measure , (Figure 4(c)). produces the probability vector for each case in (). To optimize the parameters of the classifier and decision threshold, the fitness of each candidate parameter is obtained and evaluated based on the -measure over () under 5-fold cross-validation strategy. Finally, the test dataset () is used for evaluating the performance of the proposed framework based on the optimal classifier parameter and decision threshold obtained.

**(a) Feature extraction**

**(b) Model training**

**(c) Parameter optimization**

**(d) Evaluation**

#### 3. Experimental Setup and Data Preparation for a Case Study

To obtain representative sample data for model construction and verify the effectiveness of the proposed framework, experiments were carried out. The details of the experiments are discussed in the following subsections, followed by the corresponding results and comparisons. All the proposed methods mentioned were implemented by using Matlab R2008a and executed on a PC with a Core 2 Duo E6750 @ 2.13 GHz with 4 GB RAM onboard.

##### 3.1. Test Rig and Sample Data Acquisition

The experiments were performed on a test rig as shown in Figure 5, which can simulate the GTGS of the Macau power plant. The test rig includes a computer for data acquisition, an electric load simulator, a prime mover, a gearbox, a flywheel, and an asynchronous generator. As it is not realistic to implement the diagnostic system to monitor all the components in the GTGS in one study, this research selected the fault detection of the gearbox as a case study. The test rig can simulate many common faults in the gearbox of the GTGS, such as unbalance, misalignment, and gear crack. A total of 13 cases, including one normal case, 8 single-faults, and 4 simultaneous-faults in the gearbox, were simulated in the test rig in order to generate sample training and test dataset. Some samples for single-fault and simultaneous-fault patterns are shown in Figures 6 and 7, respectively. Figures 6 and 7 show that the signal profiles between single-faults and simultaneous-faults are very similar, which make them difficult to be distinguished manually, but their degrees of similarities can be detected using the proposed framework.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

**(i)**

**(a)**

**(b)**

**(c)**

**(d)**

Table 1 shows the detailed descriptions of the thirteen simulated cases in which the mechanical misalignment of the gearbox was simulated by adjusting one height of the gearbox with shims, and the mechanical unbalance was simulated by adding one eccentric mass on the output shaft. The sample vibration data were acquired by two triaxial accelerometers located on the outer case of the gearbox as shown in Figure 5. The accelerometers are used to record the gearbox vibration signals along the horizontal and vertical directions, respectively. In the axial direction, the vibration signal is ignored since the test rig uses spur gears in which the vibration along the axial direction is not obvious. To construct and test the diagnostic framework, each simulated single fault was repeated 200 times and 100 times for each simultaneous-fault under various random electric loads. Each time, 2 seconds of vibration data was recorded with a sampling rate of 2048 Hz. The sampling rate was set to a frequency higher than the gear meshing frequency, which can ensure no missing signal. In other words, every sampling data for each case has (2 accelerometers × 2 measurement directions × 2 seconds × 2048) 16384 data points. Finally, there were 1800 single-fault sample data (i.e., (1 normal case + 8 kinds of single-faults) × 200 samples) and 400 simultaneous-fault sample data (i.e., 4 kinds of simultaneous-faults × 100 samples). In order to test the diagnostic performance for both single-faults and simultaneous-faults, the sample data was divided into different subsets as shown in Table 2, where denotes the validation set of 360 single-fault patterns without feature extraction, and denotes the validation set of the extracted features of 120 simultaneous-fault patterns.

##### 3.2. Feature Extraction by WPT and TDSF

Feature extraction is the determination of a feature vector from a signal with minimal loss of important information. A feature vector could usually be a reduced-dimensional representation of that signal so as to reduce the modeling complexity and computational cost. Through WPT, a set of 2^{L} subbands of a signal can be obtained, and *L* is the level of WPT decomposition.

With reference to the literature [8], statistical characteristics can be used for representing signal subbands effectively so that the dimensionality of a feature vector extracted from vibration signals can be reduced. For each signal subband, the statistical characteristics include the following:(1)maximum of the wavelet coefficients,(2)minimum of the wavelet coefficients,(3)mean of the wavelet coefficients,(4)standard deviation of the wavelet coefficients.

In this case study, for a vibration signal of 4096 sample points, there are 4096 wavelet coefficients and subbands after levels of WPT decomposition. However, using the previous statistics, there are only features which greatly reduce the input complexity (from 4096 inputs to 64 inputs) for the next stage. For a good presentation, the term of WPT hereafter refers to the process of wavelet packet decomposition together with the previous calculating statistics.

After decomposition by WPT, time-domain statistical method is usually further employed to extract time-domain features of the raw signals which provide the physical characteristics of time series data. For instance, references [12, 13] applied time-domain statistical features for fault detection on gear trains and low speed bearings, such as mean, standard deviation, skewness, crest factor, and kurtosis, respectively. In this study, 10 statistical time-domain features are employed to analyze the vibration signal. Table 3 presents the statistical time-domain features. After feature extraction by WPT and TDSF, the number of extracted features is shown in Table 4.

##### 3.3. Dimension Reduction by KPCA

Although the useful features can be extracted by WPT and TDSF, the dimension of these extracted features is still high. Such high dimension can degrade the diagnostic performance. To tackle this issue, KPCA is applied to obtain a small set of principal components of the extracted features. With the eigenvalues obtained from KPCA, the unimportant transformed features could be deleted. Therefore, only a limited number of the principal components are necessary, and 95% of the information in the features can be retained.

##### 3.4. Normalization

To ensure all the features having even contribution, all reduced features should go through normalization. The interval of normalization is within . The extracted feature is normalized by the following formula: where is an output feature after going through KPCA and is the result of normalization. After normalization, a processed dataset is obtained. The pairwise-coupled probabilistic classification algorithm can then be employed to construct the fault classifier based on .

#### 4. Experimental Results and Discussion

To verify the effectiveness of the proposed framework, a set of experiments using different combinations of methods were carried out by using the validation and test datasets. The performance evaluations for all experiments were done based on the -measure.

##### 4.1. Selection of Kernel Parameters, Kernel Functions, and Decision Thresholds

Since there are many combinations of mother wavelets and kernels for the probabilistic classifier, a set of experiments were carried out to determine the best combination of the system configuration. In the phase of WPT, mother wavelet and the level of decomposition were selected according to a trial-and-error method. In the family of mother wavelets, the Daubechies wavelet (Db) is the most popular one and, hence, employed for experiments. In this case study, three Daubechies wavelets Db3, Db4, and Db5 were tried, and, hence, the range of was set from 3 to 5. Moreover, three different kernel functions for KPCA, namely, *linear*, *radial basis function (RBF)*, and *polynomial*, were tested. Different kernel functions have various hyperparameters for adjustment. However, it is very time consuming to try all the different values of hyperparameters. To reduce the number of trials, the hyperparameter of RBF based on was tried for ranged from −3 to +3, and the hyperparameter of polynomial kernel was taken from 2 to 5. Moreover, the common width for both RVM and PCRVM and the common spread for both PNN and PCPNN and the decision threshold were, respectively, assumed to be 1,1 and 0.5 in the course of determination of the best configuration of feature extraction module. Under this configuration, the best combination of kernel function and its parameter for feature extraction (WPT + TDSF + KPCA) is presented in Table 5.

According to experimental results not listed here, the classifiers using KPCA under the *polynomial *kernel with and the mother wavelet of Db4 with level 4 reach the highest diagnostic accuracies, and, hence, this combination of feature extraction techniques was finally selected. Table 5 also indicates that 77 principal components are obtained by using this combination of feature extraction techniques. In other words, a raw signal of 16384 data points can be transformed to 77 features as the input variables of the classifiers.

After determining the configuration of the feature extraction module, the next step is to determine the optimal parameters of the classifiers by using the grid search. As mentioned previously, different probabilistic classifiers have their own hyperparameters for tuning. PNN/PCPNN uses spread and RVM/PCRVM employs width . In this case study, the value of was examined from 1 to 3, at an interval of 0.5, and was selected from 1 to 8 at an interval of 0.5. To find the optimal decision threshold, the search region was set within 0 to 1, at an interval of 0.01. Then 5-fold cross-validation was applied to to determine the best combination of the parameters. Finally, the best hyperparameters and thresholds for PCRVM () and PCPNN () were found to be (6.5, 0.67) and (1, 0.76), respectively.

##### 4.2. Evaluation of the Proposed Framework

To evaluate the effectiveness of the proposed probabilistic classifier, pairwise coupling, and feature extraction strategies, eight experiments were conducted, and their results are presented in Table 6. The objectives are to compare the diagnostic framework with versus without feature extraction, classification method using single-label SVM classifier versus multilabel PNN and RVM classifiers, and the framework with versus without pairwise coupling strategy. Since different aspects of the proposed framework are investigated for single-faults and simultaneous-faults, various combinations of training and test datasets are employed. For example, experiments 1, 3, 5, and 7 have no feature extraction so that the training dataset only uses the raw dataset (), whereas the feature extracted dataset () was selected as training data for Experiments 2, 4, 6 and 8. As described in Section 4.1, the best decision threshold for both PNN and PCPNN is 0.76, while 0.67 is for both RVM and PCRVM. Table 6 reveals that the experiments with feature extraction 2, 4, 6, and 8, show an increase of diagnostic accuracy by 4%–8% as compared with the experiments without feature extraction (1, 3, 5, and 7). These results indicate that the proposed feature extraction method (WPT + TDSF + KPAC) is effective.

To verify the effectiveness of the pairwise coupling strategy, a set of experiments without pairwise coupling were carried out using one-versus-all strategy, and the experimental results are shown in Table 6 (experiments 1 to 4). Comparing the results with the results of experiments 5 to 8 where pairwise coupling strategy is employed, the accuracies in the experiments without pairwise coupling are generally 2% to 5% worse. The main reason is that only binary classifiers were constructed for labels in the one-versus-all strategy, so that there are many indecision regions between pairs of classes. Therefore, when a test case lies on these regions, the classifiers mostly fail to classify the faults correctly. However, the classifiers with pairwise coupling strategy (i.e., PCPNN and PCRVM) can minimize those indecision regions. Hence, a probabilistic classifier with pairwise coupling strategy is an effective approach to improve the diagnosis accuracy. In this case study, the proposed framework in experiment 8 achieves the best -measure of 0.9129. Note that the proposed framework only employs a training set of single-fault patterns to construct the classifier while the overall performance is evaluated over both single-fault and simultaneous-fault test patterns. Therefore, the proposed method can successfully detect simultaneous-faults without costly simultaneous-fault training patterns. The effectiveness of the proposed framework is further evaluated by looking at the size of the training dataset. In the training of the proposed diagnostic model, only single-fault patterns are processed. From Table 2, there are 160 simultaneous-fault patterns in or while there are totally 1240 (i.e., ) patterns of both single-faults and simultaneous-faults. When the 160 simultaneous-fault patterns are not required, a significant reduction of training patterns at 12.9% (i.e., (160/1240) × 100%) is achieved. In fact, the time reduction is even more significant because simultaneous-faults are more costly and difficult to acquire.

In order to further verify the superiority of the proposed framework, the latest single-label binary classification framework based on SVM [28] was also applied to the same datasets for comparison. The evaluation results are shown in Table 7, in which the overall diagnostic accuracy of the single-label method is lower than that of the proposed PCRVM framework. Therefore, it can be concluded that the proposed framework is currently the best method for single and simultaneous-fault diagnosis of the GTGS.

#### 5. Conclusions

In this paper, simultaneous-fault diagnosis for the GTGS is studied. A systematic framework combining feature extraction, pairwise coupling probabilistic classification, and parameter optimization based on a partial-match assessment has been developed to overcome the challenges of simultaneous-fault diagnosis for complicated mechanical systems. In the proposed framework, the feature extraction module is designed by combining the techniques of WPT + TDSF + KPCA to effectively capture the single-fault components in the simultaneous-fault vibration patterns. A pairwise coupling strategy is also employed to deal with the interaction between the independent labels, which outperforms the approaches without pairwise coupling by 2% to 5% in simultaneous-fault diagnosis. In short, this research work makes contributions to four aspects: it is the first research to find that the features of single-fault signal patterns of the GTGS can be found from their simultaneous-fault patterns by the proposed feature extraction method; a high diagnostic accuracy for both unseen single-fault and simultaneous-fault patterns is achieved by the proposed framework; the proposed framework can achieve a high diagnostic efficiency while the request for large amount of expensive simultaneous-fault training data is no longer necessary; it is the original application of the proposed framework to the problem of GTGS diagnosis. Since the proposed framework for simultaneous-fault diagnosis is general, it could be applied to other similar industrial problems.

#### Nomenclature

Diagonal matrix of hyperparameters | |

: | Diagonal matrix in RVM |

: | th probabilistic classifier |

: | Probability of belonging to the th label |

: | Pairwise classifier |

: | Probability of belonging to the th label against the th label |

: | Dataset of single-faults |

: | Dataset of simultaneous-faults |

: | Feature matrix after feature extraction |

: | Training feature matrix after feature extraction |

: | Validation feature matrix after feature extraction |

: | Test feature matrix after feature extraction |

: | th raw signal data |

: | Upper limit of feature |

: | Lower limit of feature |

: | Feature set extracted by WPT and TDSF |

Number of single-fault labels | |

: | Hyperparameter of polynomial kernel function of KPCA |

: | Feature matrix of single-faults |

: | Feature matrix of simultaneous-faults |

: | -measure |

Feature vector | |

: | th feature vector |

Set of feature vectors | |

: | Kernel function of KPCA |

: | Decomposition level of Db mother wavelet |

: | Probabilistic classifier |

: | Number of training data |

: | Probability |

: | Spread of PNN |

: | Optimal spread of PNN |

Set of fault labels in training dataset | |

: | Fault label of the th training case |

Width of RVM | |

: | Optimal width of RVM |

: | Most probable weight vector in RVM |

Raw signal data | |

: | Predicted label vector |

: | th predicted label |

Normalized feature | |

: | Predicted decision |

: | Transformed variables for vector |

: | Hyperparameter vector of RVM |

: | th hyperparameter in vector |

: | th element in eigenvector corresponding to the th largest eigenvalue |

: | th eigenvalue of KPCA |

Decision threshold | |

: | Optimal decision threshold |

: | Probability vector |

: | Probability of the th vector |

: | Covariance matrix in RVM |

: | th diagonal element of covariance matrix Σ |

: | Logistic sigmoid function |

Design matrix in RVM. |

#### Acknowledgment

The authors would like to thank the funding support by the University of Macau, Grant nos. MYRG153(Y1-L2)-FST11-YZX and MYRG075(Y2-L2)-FST12-VCM.