Abstract

Faulty samples are much harder to acquire than normal samples, especially in complicated systems. This leads to incompleteness for training sample types and furthermore a decrease of diagnostic accuracy. In this paper, the relationship between sample-type incompleteness and the classifier-based diagnostic accuracy is discussed first. Then, a support vector data description-based approach, which has taken the effects of sample-type incompleteness into consideration, is proposed to refine the construction of fault regions and increase the diagnostic accuracy for the condition of incomplete sample types. The effectiveness of the proposed method was validated on both a Gaussian distributed dataset and a practical dataset. Satisfactory results have been obtained.

1. Introduction

Fault diagnosis has been the subject of great interest in the control research community in response to the increasing requirement of operating reliability and product quality [14]. During the last decades, fault diagnosis has been well studied and many useful approaches have been proposed, such as the parameter estimation [5] and state estimation [6]. Generally, these approaches could diagnose the faults by analyzing the residual between the real output and the model output, which are so-called model-based fault diagnoses. The core of the model-based diagnostic approach is to build a process model running parallel to the process [1].

Due to the ever-growing complexity of industrial systems, the modeling of a complex industrial process becomes very difficult and time-consuming. In some newly developed processes, such models are even unavailable. Consequently, data-driven approaches have been introduced to the field of fault diagnosis. This kind of approach does not need to build precise models for industrial processes. Instead, it utilizes the processing data, including offline and online data, to approximate the relationship between system inputs and the operating statuses. The data-driven approach is especially suitable for the fault detection (FD) and isolation for complex systems and hence has aroused much interest recently [4, 79].

One classical way for data-driven diagnostic approaches is to divide the data distributing space into several fault regions by employing particular classifiers [10]. Then, in each fault region, all data samples belong to the same operating status [11]. This classifier-based approach makes the diagnosis by locating the fault region that the testing sample falls into and avoids the requirement of system models and expert knowledge. During the last 10 years, numerous studies have been done concerning the issues of its diagnostic accuracy and decision speed. The reliability and feasibility of the classifier-based fault diagnosis approach have been significantly improved [1214].

As has been pointed out by Russell et al. [7], the main drawback of data-driven approaches is that the diagnostic proficiency is highly dependent on the quantity and quality of the process data. When using the classifier-based approaches for fault diagnosis, fine data samples are required to ensure the performance of the diagnosis. More specifically, according to the research of Hakkila et al. [15], the classification performance is especially sensitive to the completeness of the training samples. A complete set of training samples is essential for making the correct region divisions. This implies that when using classifier-based approaches, training samples for all types of faults should first be prepared.

However, data-driven approaches show a significant difference in the difficulty of acquiring different faulty samples in real applications. In most cases, samples for common faults and for normal operating statuses could easily be obtained, whereas samples for rare faults and multifaults are seldom acquired. It is very hard to collect a complete training set which contains samples for all possible faults [14]. The problem of sample-type incompleteness has greatly reduced the feasibility of classifier-based approaches.

Consequently, the need to improve the diagnostic performance in the condition of incomplete faulty samples becomes an issue which greatly hampers the practical applications of data-driven fault diagnosis. However, little has been addressed in the literature on this issue. In this paper, a support vector data description (SVDD-) based approach has been proposed in an attempt to improve the diagnostic performance of the classifier-based approach in the condition of incomplete samples. It reduces the sensitivity of diagnostic accuracy towards sample completeness.

This paper is organized as follows. Section 2 illustrates how the sample-type incompleteness decreases the diagnosing accuracy. Section 3 introduces the classification mechanism of the SVDD. Section 4 presents the refined diagnostic algorithm using SVDD. In Section 5 the effectiveness of the proposed method is validated by experimental comparisons with conventional methods. Finally, conclusions and a discussion are given in Section 6.

2. Effects of the Sample-Type Incompleteness on the Classifier-Based Fault Diagnosis

Fault diagnosis includes FD and fault isolation (FI). FD attempts to judge whether the system is faulty or normal and then FI is employed to identify which fault has occurred in the system. In contrast to the conventional fault diagnosis where FD and FI are two different steps, the classifier-based diagnostic approaches complete both FD and FI in one step. Therefore, it has been widely used due to the high efficiency for fault diagnosis. The classifier-based approach uses particular classifiers to map the linear inseparable faulty samples into a high-dimensional space where different types of samples are linearly separable. Furthermore, a hyperplane is generated to divide the high-dimensional space into several regions, called fault regions, and each region consists of only one faulty-type sample, as shown in Figure 1. When testing samples are imported, they are first mapped into the same high-dimensional space. Then, by locating which fault region they have fallen into, the operating status can be identified, and the aim of FD and FI can be achieved [10, 11].

Given a training set, , which consists of process data for a normal status, , and types of faulty samples where is the sample set for the th fault, classifiers like artificial neural networks (ANNs) and multiclass SVMs divide the sample distributing space into one “Normal” region and “” faulty regions: where is the mapping function for the given classifier and is the hyperplane. The fault regions should be .

When the testing samples, , are acquired, the fault can be isolated by locating which fault region they have fallen into by where is the decision function and is its decision value. Because the testing samples, , have already been obtained while diagnosing, can be considered constants in the above formula. Thus, the diagnostic performance mainly depends on three factors: , , and . Here, the function, , is decided by the classification performance including algorithm selection and parameter setting, and the function, , is decided by the fault diagnostic flowchart. Thus the training set, , can be regarded as the only variable in formula 3 that decides the diagnostic accuracy. In the condition of fault type incompleteness, for example, no process data has been acquired for some operating statuses; the variable, , would be here types of faulty samples are assumed to be missing. As the variable changes, a different hyperplane, , would be generated, and furthermore the diagnosing result, , is likely to be changed. This implies that the incompleteness of fault types leads to a misclassification and a decrease of diagnostic accuracy.

Figure 2 is employed to illustrate how the sample-type incompleteness affects the diagnosis. In this figure, the complete training set is composed of 4 types of samples. Figure 2(a) shows the fault regions which are made in the condition of a complete training set. Then, taking the 4th faulty sample away, the new generated fault regions can be seen in Figure 2(b), where the region divisions are significantly different.

The shadow areas in Figure 3 denote the misclassified fault regions that were brought on by the incompleteness of the sample type. The testing samples which fall into these shadow areas will be diagnosed incorrectly.

3. Support Vector Data Description

SVDD is a kind of one-class classifiers which judges whether a testing sample belongs to the target sample type or not and requires only one type of samples while training [16]. It attempts to find a hypersphere with a minimum volume and contains all target samples, as shown in Figure 4. When the test sample has fallen into the hypersphere, it belongs to the same type as the target samples.

Given the target data, , where is the number of training samples, the SVDD searches for the hypersphere, , by minimizing the radius: where denotes the centre of the sphere and is its radius. is the regular factor that gives the trade-off between the radius and the number of errors and is the slack factor. The following Lagrangian was constructed to solve this problem: where and are the Lagrange multipliers. After taking the partial derivative, the dual problem for formula 6 can be written as

Formula 7 is a standard quadratic programming problem. There are many well-studied methods to solve such problems, for example, the active set method [17]. As can be obtained from formula 7, and in forumla 5 can also be deduced.

Given a testing sample, , if then   belongs to the same type as the target data, .

Remark 1. Conventional classifiers such as Support Vector Machines (SVMs) require at least two types of samples while making the hyperplane. The output of these classifiers implies the type to which the testing sample is the most likely to belong. However, SVDD, which is specifically designed for one-class classification, requires only one type of training sample, that is, the target data. The aim of the SVDD classification is to judge whether the testing sample belongs to the type of target data.

4. Fault Diagnosis Using SVDD for the Condition of Sample-Type Incompleteness

As shown in formula 2, conventional classifiers take the whole training set, , as the input variable for constructing the hyperplane. As varies, the hyperplane varies, and furthermore the diagnostic result should be changed as in formula 3. This paper addresses a new framework for classifier-based fault diagnosis which is capable of implementing the type incomplete training set to construct a reasonable hyperplane and can gradually refine its diagnostic performance as more types of samples are added to the training set.

4.1. The Basic Idea of the Proposed Diagnosing Approach

Unlike traditional methods, the proposed approach does not construct all regions in a go. It constructs the regions step by step.

As shown in Figure 5, the SVDD-based approach firstly judges whether the testing sample belongs to “Known” types or “Unknown” types. Then, if the testing sample belongs to the “Known” types, the approach locates which region should the testing sample fall into.

Remark 2. “Known” types refer to types whose operating data samples have been acquired. “Unknown” types refer to types whose operating data samples are missing.

Figure 6 is implemented to illustrate how the SVDD-based approach constructs the fault regions.

The gray area in Figure 6 is the “Known” region and also refers to space where diagnosis could be made. From (a) to (d), we see the following.(1)When new types of samples are imported, the SVDD-based approach just adds corresponding regions but does not reconstruct all regions. This means that if one fault region is constructed, the region will never change no matter how many more sample types are imported in the training.(2)More sample types lead to a larger proportion of the gray area. And this indicates that the diagnostic ability could be improved as more sample types are acquired. Suppose a “Fault 2” sample was implemented as a testing sample. In (a) and (b), as the sample types for training are rare, the testing is classified to “Unknown” region. We can know that the testing sample belongs to neither “Normal” nor “Fault 1,” but we do not know indeed which type it belongs to. However, in (c) and (d), as more sample types are imported for training and the fault regions are significantly refined, one can easily make an accurate diagnosis.

4.2. The Flowchart for the SVDD-Based Fault Diagnosis
4.2.1. The Definition of Evaluating Function

Given a training set, , which consists of a normal sample set and types of faulty samples . Using the SVDD approach, hyperspheres, , , can be obtained. denotes the hypersphere for normal set, and the other hyperspheres represent the faulty sets. These hyperspheres represent fault regions constructed by existing samples.

Definition 3. is the evaluating function, which implies the likelihood that the testing sample belongs to the target set.

Here is the testing sample; and are the radius and centre of the hypersphere, respectively. The centre, , and the radius can be obtained by calculating the distance between the centre and any relevance vector, : Hence, the evaluating function can be rewritten as

A larger value of implies a higher likelihood that the testing sample belongs to the same type as the target samples in the hypersphere. And we have where implies that the testing sample has not fallen into the hypersphere.

4.2.2. The Flowchart of the Proposed Approach

For a training set consisting of incomplete types, the SVDD approach has been introduced into the diagnosis [1821]. The flowchart of the proposed approach is shown as follows.

Step 1. For the training set, , a hypersphere, , is constructed, which involves all existing samples while training. Given a testing sample, , an index is calculated:
If , the testing sample belongs to “Unknown” types and requires more sample types to isolate the fault type. Otherwise, the testing sample belongs to the “Known” types, and Steps 24 are proposed for fault isolation.

Step 2. For the normal set, , the training set can be written as . For , the hypersphere, , is constructed by the SVDD approach.

Step 3. Given a testing sample, , its evaluation function value can be calculated for all hyperspheres, , . where , , is the Lagrange multiplier for , and is the number of samples for . We then get the decision value for diagnosis

means that the testing sample has fallen into the “Known” region, but its likelihood for all types is very low. It is hard to decide which region should the testing sample belong to. In this paper, we just simply divide such sample to “Normal” set, with the consideration of reducing false alarms. The relationship between the value and the fault type is shown in Table 1.

Step 4. According to the decision rules shown in Algorithm 1, the diagnostic decision can be calculated and the FI can be achieved. Here “decision = ” means the testing sample belongs to the th fault type.

Switch 
   Case 1
     if       
             
       else
             
       end
   Case−1
       Testing sample belongs to unknown types
end

5. Experimental Validation

In this section, a group of Gaussian distributed synthetic datasets are firstly implemented to validate the feasibility and effectiveness of the proposed approach. Then, a real-world dataset for transformer faults is implemented. The performances of the SVDD-based approach are compared with the popular SVM-based approach [2224] and some other classic approaches like Least Squares Support Vector Machine (LS-SVM) [25], learning vector quantization (LVQ) network [26], and random forests [27].

5.1. The Synthetic Datasets

The training set consisted of three Gaussian distributed synthetic datasets, namely, , , and , and each dataset represents one operating status, as shown in Figure 7.

5.1.1. One Type Is Missing

The given system yields three operating statuses: “Fault 1,” “Fault 2,” and “Normal,” and their corresponding sample sets are “,” “,” and “.” The diagnostic performances of both SVM-based and SVDD-based approaches are first investigated in the condition with one sample type missing.

Suppose the sample set, , is missing. We made the fault regions using the SVM-based approach and the proposed approach. As shown in Figure 8(a), the fault regions made by the SVM can efficiently identify the faulty types, and ; however, there is no region for , and all samples belonging to will be misclassified into or .

In contrast to the SVM-based approach, the proposed approach constructs an updatable framework for fault diagnosis and fully takes into consideration of the classification of unknown samples. As shown in Figure 8(b), where the fault regions for “Fault 1” and “Fault 2” are made, the SVDD-based approach also makes a region for unknown sample type. Therefore, the samples belonging to will be classified into the “Unknown” region instead of the “Fault 1” region or “Fault 2” region. This makes the diagnosis more reasonable.

100 samples were randomly selected as testing samples. We investigated the diagnostic accuracies of the two approaches for the condition of one sample type missing, which is shown in Table 2.

As shown in Table 2, the diagnostic accuracy of the SVM-based approach has been reduced from 100% to 64–71% when one of the sample types is missing. This implies that the sample-type incompleteness greatly affects the diagnostic accuracy. However, for the SVDD-based approach, the accuracy varies slightly when a sample type is missing. This demonstrates that the diagnostic accuracy of the SVDD-based approach is insensitive to the sample-type completeness. Moreover, when a sample type is missing, the accuracy of the SVM-based approach varies from 64% to 71% and the accuracy of the proposed approach varies from 86% to 89%. The proposed approach shows significant superiority to the classic SVM-based approach when dealing with incomplete sample types.

5.1.2. Fault Detection Using the SVDD-Based Approach

We suppose the system has acquired only the samples for a “Normal” status; that is, the training set is composed of . Because the conventional classifier-based approaches require at least two types of samples, the FD cannot be made. However, the SVDD-based approach solves the problem successfully. Using the proposed approach, the region for and the “Unknown” can be obtained. The region for is referred to as the “Normal” region and the “Unknown” region is thus referred to as the “Faulty” region. By locating which region the testing sample has fallen into, the FD is achieved.

As shown in Figure 9(c), the purple area is the “Normal” region and the blue area is the “Faulty” region. All samples belonging to fall into the “Normal” region and samples from and fall into the “Faulty” region. This demonstrates the feasibility of the proposed approach in FD.

5.2. The Real-World Dataset

The real-world dataset for a transformer fault diagnosis is obtained from the literature [28]. This dataset monitors the gas content dissolved in the insulating oil of the transformers, which has close relationships with the operating statuses. Here, operating statuses are investigated, that is, “Normal” status, “High-energy discharge” fault, “Low-energy discharge” fault, and “Thermal heating” fault. The training set and testing set are prepared as shown in Table 3.

The radical basis function (RBF) kernel is selected for training the SVM and SVDD classifier and the parameter, , is set 0.5. In all of the experiments, the testing set is selected as and the training set is selected, respectively; for example, when samples for “Low-energy discharge” fault are missing, the training set should be .

In Figure 10, we investigated the diagnostic performances of the two approaches for the condition of each kind of sample-type incompleteness. The training sets selected are , , , and for cases when one sample type is missing, and , , , , and , for cases when two sample types are missing.

As shown in Figure 10, in all cases, the SVDD-based approach yields a better accuracy than the SVM-based approach. Moreover, for the SVM-based approach the accuracy is significantly decreased when more sample types are missing. The average accuracy for one sample type missing is 69% and when two sample types are missing the average accuracy falls to 47.33%. Conversely, the SVDD-based approach varies slightly while different portions of sample types are missing. The average accuracy for the proposed approach varies from 76% to 88%. This implies that the SVDD-based approach can efficiently improve the classifier-based diagnostic performance while the sample type is incomplete.

For a further comparison, classical methods such as Least Squares Support Vector Machine (LS-SVM) [26], learning vector quantization (LVQ) network [27], and random forests [28] are also introduced to solve the problem. For LS-SVM, the parameter , and ; for LVQ network, the size of hidden layer is set 10, and the LVQ learning rate is set ; for random forests, the parameter “mtry” is set , and the parameter “ntree” is set 500.

Training sets with different completeness are employed as the above experiment. The diagnostic performances for SVDD and these methods are investigated and shown in Table 4.

For most missing types, traditional methods have made an incorrect classification and lead to false alarms. But the SVDD-based approach just divides these samples into “Unknown” types. This result informs the users that more sample types are required for diagnosing the tested sample. The false alarms will greatly be reduced.

6. Conclusion and Discussion

The performance of the data-driven approach depends on the quality of the training set. However, a high quality training set is not easily acquired. Therefore, the remaining problem in both theoretic research and practical applications is how to make a reasonable fault diagnosis in the condition of a low quality training set.

In this paper, we have discussed the problem of sample-type incompleteness; that is, when some types of training samples are missing the diagnostic accuracy is commonly decreased. An SVDD-based diagnosis approach has been addressed for this issue. This approach provides a new framework for diagnosing the incomplete samples, which implements the one-class classifier, that is, SVDD, instead of conventional binary or multiclass classifiers and attempts to improve the diagnostic accuracy in this situation.

The effectiveness of the proposed approach has been validated by comparative experiments on both synthetic and practical training sets. The proposed approach has shown a significant superiority to the popular SVM-based fault diagnosis approach. This demonstrates that the SVDD-based approach is insensitive to the sample-type completeness and can efficiently increase the diagnostic accuracy when dealing with incomplete training sets.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work described in this paper is supported by grants from the Doctoral Fund of Ministry of Education of China (20113218110011 and 20113218120010), the National Science Foundation of China (61171191, 61203020, and 61401215), the Science Foundation of Jiangsu province (BK20140953), and the Science Foundation of Jiangsu High Schools (13KJB510013).