With the rising demand for integrated and autonomous systems in the field of engineering, efficient frameworks for instant detection of performance anomalies are imperative for improved productivity and cost-effectiveness. This study proposes a systematic predictive maintenance framework based on the hybrid multisensor fusion technique of fuzzy rough set feature selection and stacked ensemble for the efficient classification of fault conditions characterised by uncertainties. First, a feature vector of time-domain features was extracted from 17 multiple sensor signals. Then, a comparative study of six different Fuzzy Rough Set Feature Selection (FRFS) methods was employed to select the various combinations of optimal feature subsets for various faults classification tasks. The determined optimal feature subsets then served as inputs for training the stacked ensemble (ESB(STK)). In the ESB(STK), Support Vector Machine (SVM), Multilayer Perceptron (MLP), -Nearest Neighbour (-NN), C4.5 Decision Tree (C4.5 DT), Logistic Regression (LR), and Linear Discriminant Analysis (LDA) served as the base classifiers while the LR was selected to be the metaclassifier. The proposed hybrid framework (FRFS-ESB(STK)) improved the classification accuracy with the selected combinations of optimal feature subset size whiles reducing the computational cost, overfitting, training runtime, and uncertainty in modelling. Overall analyses showed that the FRFS-ESB(STK) proved to be generalisable and versatile in the classification of all conditions of four monitored hydraulic components (i.e., cooler, valve, accumulator, and internal pump leakage) when compared with the six base classifiers (standalone) and three existing ensemble classifiers (Stochastic Gradient Boosting (SGB), Ada Boost (ADB), and Bagging (BAG)). The proposed FRFS-ESB(STK) showed an average improvement of 11.28% and 0.88% test accuracies when classifying accumulator and pump conditions, respectively, whiles 100% classification rates were obtained for both cooler and valve.

1. Introduction

In the field of engineering, one of the biggest challenges faced with the rising demand for integrated and autonomous systems is the imperfect aspects of the raw sensor data recorded and how they are being processed for predictive-based maintenance frameworks. These imperfections, ranging from uncertainties caused by inherent noises in sensor measurements to inconsistency in data recorded, could be attributed to diverse operational conditions under which these complicated systems embedded with numerous and varying sensor outputs operate [1]. As a result, frameworks developed from such imperfections possess tractability challenges and have the tendency of producing misleading information about the state of the system, which could subsequently lead to wrong decisions [2].

In the literature, significant interest has been dedicated to the development of methodologies capable of addressing such imperfections in data. Notable among such methodologies is the Multisensor Data Fusion (MDF) concept of integrating data from multiple sources (taking into account their diversities) into a representation for improving human and automated decision-making frameworks [36]. Resounding among the varieties of MDF techniques designed in the literature are the Fuzzy Set Theory (FST) [7] and the Rough Set Theory (RST) [8]. The FST efficiently deals with imprecision by introducing the concept of partial set membership which facilitates reasoning in an imprecise manner (rather than crisp) [7]. This property makes the FST particular useful in fusing data from both homogeneous sources and highly conflicting or ill-defined data [9]. However, the FST requires prior knowledge of membership functions for various fuzzy sets. This is a significant drawback in FST as any decision made by the user may possibly be faulty or based on their subjective judgement.

These limitations resulted in the proposal of the RST which addresses the imprecision, vagueness, and uncertainty in data analysis by exploiting its internal structure (granularity) [8]. By utilising the granularity concept, the hidden knowledge in the pool of data could be identified and expressed as decision rules. In addition, RST has the ability to identify the most informative subset within a pool of attributes (feature reduction or selection) and more importantly does not require any prior knowledge about the data distribution or membership functions [912]. These concepts of RST, especially feature reduction, are perhaps the major characteristic that makes it unique from the existing traditional fusion techniques. In that, in addressing the imperfections in a given dataset, the rough set-based feature selection focuses on removing redundant or noninformative features. This leads to a desirable reduction in computational cost, overfitting, training runtime, and uncertainty in modelling whiles improving accuracy [13]. Hence, the usefulness and versatility of RST have been demonstrated successfully in various disciplines including Predictive Maintenance (PdM) [1417].

However, as frequently observed, data recorded from monitoring the condition of systems for the purpose of developing PdM frameworks are generally real-valued and are characterised by noise. As a result, traditional RST originally developed for crisp or discrete data are unable to process such data [8, 18, 19]. Hence, there is the need for fusion techniques capable of modelling data imperfections, possess the unique characteristic of feature reduction for both crisp and real-valued datasets, and above all, do not require prior knowledge or input to be supplied by the user. These deficiencies could be addressed by coupling and exploiting the unique strengths of fuzzy and rough sets theories [20]. Thus, the hybrid Fuzzy Rough Set Theory (FRST) technique combines the distinct concept of fuzziness and indiscernibility for addressing imperfections in both discrete and real-valued data without the need for user-supplied input. This ensures that the most informative knowledge from the pool of sensor data (i.e., selected features) is fed as inputs in the development of an intelligent and autonomous predictive maintenance-based framework.

In the context of developing intelligent frameworks for PdM purposes, the utilisation of supervised machine learning algorithms such as the Artificial Neural Networks (ANNs) [2124], Support Vector Machine (SVM) [2528], Linear Discriminant Analysis (LDA) [2931], and the Bayes classifiers [3235] have been well studied. Although these algorithms have yielded to some extent satisfactory results, they are prone to local optima entrapment when their required hyperparameters are not appropriately fine-tuned. This subsequently affects their performance to discriminate between various classes or label outputs. Also, a considerable number of resources such as time are being invested in the selection of a specific learner for a given task. As a remedy, several researchers have employed ensemble learning which combines different sets of hypotheses from multiple learners with the aim of improving the performance of systems [3639]. It is important to note that the superiority of ensembles as a suitable alternative over standalone learners has been well established [36, 40, 41]. For these reasons, the stacked ensemble learning which considers heterogenous base learners is proposed in this study for the development of the intelligent PdM framework.

1.1. Contributions of the Study

The contributions of this study are to(a)propose a systematic PdM framework for the efficient classification of fault conditions characterised by uncertainties based on the hybrid multisensor Fuzzy Rough Set Feature Selection (FRFS) and the stacked ensemble and(b)evaluate and compare the performance of the proposed framework approach with some well-established classifiers (standalone) and ensembles.

The efficiency of the proposed PdM framework was validated on a benchmark dataset of multisensors obtained from the University of California, Irvine (UCI) Machine Learning (ML) repository [29]. The proposed hybrid framework advances the field of PdM by effectively and efficiently fusing recorded data from multiple and varying sensors, thus having enormous potential for enhancing the reliability of autonomous systems during an unexpected sensor failure, enhancing robustness and confidence in estimations, and expanding the sensitivity and specificity ability of systems to efficiently discriminate between class outputs.

The remaining sections are organised as follows: Section 2 briefly describes the experimental hydraulic dataset with multisensors. Section 3 discusses the methodology of the proposed hybrid technique, feature extraction, the FRFS, and stacked ensemble. The results and discussion are presented in Section 4. Section 5 concludes the findings.

2. Hydraulic System Dataset

The hydraulic system dataset utilised in this paper can be accessed publicly from the ML repository at UCI. The data comprised 2205 instances and 43680 features which were collected from 17 process sensors with varied sample rates. The 17 process sensors were made up of six pressure sensors (PS), four temperature sensors (TS), two volume flow sensors (FS), and five separate sensors for measuring motor power (EPS), vibration (VS), cooling efficiency (CE), cooling power (CP), and system efficiency (SE). With this setup, several typical system failures such as pressure leakage in the accumulator, reduced cooling efficiency, internal pump leakage, or delayed valve switching are simulated to depict variations in fault conditions of four major hydraulic components, namely, the accumulator, cooler, internal pump leakage, and valve. The details of the four major hydraulic components are shown in Table 1. As seen in Table 1, each monitored condition has multiple classes representing the various degradation states of the hydraulic system. Also, a load cycle with a 60-second duration is repeated 2205 times with the distribution of instances as indicated by the cases.

Figures 13 show the plot of the 17 process sensors during some selected load cycles. Each subplot represents a specific sensor (abbreviation) followed by a brief description of the degradation states of the four monitored hydraulic components (cooler, valve, pump and accumulator). For instance, the first signal in Figure 1, Cooler Efficiency (CE), is shown for a load cycle with Close to Total Failure (CTFc) in the cooler, Optimal Switching Behaviour (OSB) in the valve, No Leakage (NL) in the pump, and Optimal Pressure (OP) in the accumulator pressure. A similar description is extended to the remaining signals.

3. Methods

The proposed hybrid technique is illustrated in Figure 4 as a schematic starting from feature extraction through to the validation phase. The subsequent sections present the details of the various phases.

3.1. Feature Extraction and Reduction

One major challenge that needs to be addressed in order to develop an efficient PdM framework is the relatively high dimension of features (raw sensor data of 43680 features from the 17 process sensors). This is because, classical ML algorithms experience challenges with tractability, scalability, and high time complexity, which negatively impact classification performance [42, 43]. That is, these classical ML algorithms are unable to detect features that contain the most relevant fault information needed for efficient fault classification [44].

Based on the aforementioned reasons, this paper adopts a well-established strategy by extracting some statistics-based features from the raw 43680 sensor data that describe the hydraulic dataset’s characteristic attributes. Hence, different statistical time-domain features were extracted based on the variance, median, mean, standard deviation, kurtosis, skewness, and position of maximum values from various time intervals partitions. Details of the time-domain feature extraction can be found in the prior works of Buabeng et al. [45, 46]. The feature extraction drastically reduced the dimension of the hydraulic dataset to 1806 features, thus to some extent reducing the computational cost.

However, due to the influence of insignificant and redundant features, training a classifier with the entire 1806 features can negatively impact the classifier’s ability to effectively and efficiently discriminate between class output [47]. This will in effect increase the cost of computation, overfitting, training runtime, and the model’s uncertainty. For these reasons, the most optimal feature subsets that capture relevant fault characteristics were identified using the FRFS. The FRFS technique, as opposed to the correlation-based feature selection technique, was adopted by prior works such as Helwig et al. [29]. The advantage of the FRFS is that it does not require prior knowledge or the user to subjectively specify the number of features to select as wrong judgement may influence the efficacy of the model being developed.

3.2. Handling Data Imperfection with Multisensor Data Fusion (MDF)

The MDF technique is a multidisciplinary research discipline utilising ideas from signal processing, information theory, statistical estimation and inference, and artificial intelligence [2]. Its core concept of integrating data from multiple sources (taking into account their diversities) into a representation for improving human and automated decision-making frameworks was first introduced in the 1960s, however, was implemented in the field of robotics and the military in the 1970s [4, 6, 48]. Today, MDF techniques are widely used for preprocessing and handling data imperfections and are known for their enormous potential in numerous fields of autonomous systems such as pattern recognition, system monitoring, fault identification, intrusion and malware detection, and health care, among others [4952]. This has been realised through the dynamic integration of the comprehensive and varied pool of knowledge arriving from the multiple sensors.

In dealing with various aspects of data imperfection, varieties of MDF techniques have been designed. Traditionally, the most dominant are the Probabilistic Fusion (PF) [53], FST [7], Evidence Belief Reasoning (EBR) [54], and the RST [8]. However, these varieties in their implementation differ with varying characteristics, strengths, and limitations. For instance, the PF which is based on the probability distribution or density functions addresses to some extent the uncertainties in data but however is perceived with limitations such as complexity, inconsistency, and precision and requires prior knowledge of the probability distribution [7]. For these reasons, the FST which addresses imprecision, and the EBR which addresses both uncertainty and imprecision were proposed as alternatives [9].

The EBR in addressing uncertainty and imprecision assigns belief and plausibility to possible measurement hypotheses in addition to an appropriate combination rule based on Dempster Shafer’s theory in the fusion process [54]. However, EBR is computationally expensive and has the tendency of generating counterintuitive results in the presence of fusing conflicting (heterogenous) data [9]. The FST efficiently deals with imprecision by introducing the concept of partial set membership which facilitates reasoning in an imprecise manner (rather than crisp) [7]. This property makes the FST particularly useful in fusing data from both homogeneous sources (conjunctive fuzzy fusion rules are utilised) as well as highly conflicting data (disjunctive fuzzy fusion rules are utilised) [9]. Thus, while both the PF and EBR are well suited for handling the uncertainty of data with a well-defined class of objects, the FST is suited for an ill-defined class. However, just like the PF, the FST requires prior knowledge of membership functions for various fuzzy sets. This is a significant drawback in FST as any decision made by the user may possibly be faulty or based on their subjective judgement.

These limitations resulted in the proposal of the RST which has been successful over the years [14, 16, 17, 55]. The RST addresses the imprecision, vagueness, and uncertainty in data analysis by exploiting its internal structure (granularity) [8]. The success of RST is primarily due to its capacity to utilise the most informative features, possesses feature reduction ability, and moreover does not require any user-supplied information [912]. However, the traditional RST was proposed for crisp or discrete data, thus, was limited in its application to real-valued data [8, 18, 19]. This led to developing a synergistic approach of combining the FST as a complementary technique to RST since the techniques could both address imprecision and inconsistency. Hence, the hybrid FRST provides a comprehensive framework by combining the distinct concept of fuzziness and indiscernibility for addressing imperfections in both discrete and real-valued data [20]. In addition, FRST requires no user-supplied input and possesses feature selection capabilities.

From the discussion, the concept of the FRST can be considered as a generalisation of the rough set constructed based on two theories: the rough set and fuzzy set theories. The theories are briefly described as follows.

3.2.1. Fuzzy Set Theory (FST)

The FST was proposed by Zadeh [7] and is an extension of the classical set theory for dealing with vagueness. Since its introduction, the theory has been developed and extended by other researchers with application in diverse disciplines of engineering and science [5658] of which PdM is no exception [5961].

The basic concept behind FST revolves around the membership status of an item [62]. That is whether an item, say is a member or not a member of a set, say ; or . As opposed to the classical set theory where an item should belong to a set or not, an item in FST can belong to a set to a degree . Thus, the fuzzy membership function is expressed aswhere denotes the item and represents the set.

The fuzzy membership function has the following properties:

Hence, this suggests that the membership of a set determines whether an item belongs to the union and intersection of the set.

3.2.2. Rough Set Theory (RST)

In the literature, the success in the implementation of RST [8] in addressing data imprecision, vagueness, and uncertainty has well been established [17, 55]. By exploiting the internal structure of the dataset, the basic RST is described based on the concept of indiscernibility.

Suppose that an information system is expressed aswhere and represent nonempty sets of finite objects as a universe of discourse (all instances) and features, respectively, such that for every . represents the set of possible values of the feature . Given any there exists an associated equivalence relation as shown in

From (6), the partition of generated, denoted by , is estimated usingwhere the operator is defined as shown in (8) using two imaginary sets, say and :

When , it implies that and are indiscernible based on features from . Denoting the equivalence classes of the -indiscernibility relations by and assuming as a subset of , , approximating using information contained in is achieved by constructing the -upper and -lower approximations of as shown in (9) and (10), respectively:where and , respectively, are the upper and lower approximations of with respect to . That is, the comprises of possible objects classified in whiles object certainly classified in are listed in the . The ordered pair is the rough set. Assuming and are subsets of features from , with equivalence relation over , then the different regions (positive, negative, and boundary regions) are expressed as

The positive region, , comprises of all objects classified into classes of , the boundary region, , comprises of possible but not definite objects classified, and the negative region, , contains objects not classified into classes of .

In data analysis, discovering the dependencies within features is of paramount importance. Thus, a set of features is said to be totally dependent on a set of features if the values of the feature in can be determined uniquely by the values of features from . In RST, the dependency of in a degree , denoted by , is expressed in

Such that totally depends on if , partially depends on if and does not depends on if .

The RST process described here was proposed for crisp or discrete data, thus, was limited in its applicability to real-valued data which is most common [8, 18, 19]. A possible alternative around this limitation will be to create in advance new crisp data out of the real-valued data through discretisation. However, the procedure is still inadequate as the degree of membership or similarity between feature values is not exploited [63, 64]. Notwithstanding, the hybrid FRST addresses the aforementioned limitation by combining the concepts of fuzziness and indiscernibility for addressing imperfections in both discrete and real-valued data.

3.2.3. Fuzzy Rough Set Theory (FRST)

The FRST can be thought of as a generalisation to the RST where the approximation from fuzzy sets, lower and upper fuzzy approximations, is derived in a crisp rough set space. That is, fuzziness is integrated into rough sets by defining the lower and upper approximations of the set when , the nonempty set of finite fuzzy sets, becomes rough because of the equivalence relation.

Suppose that the subsets of features from , , with equivalence relation over denoted as , the equivalence class can be expressed as fuzzy sets if the class to which for all attribute are ambiguous. Thus, the fuzzy - lower and - upper approximations of are expressed aswhere is the fuzzy concept to be approximated with being the fuzzy equivalence class belonging to . The ordered pair is the fuzzy rough set. These definitions deviate a bit from the lower and upper approximations under the crisp rough set due to the inability to explicitly access the membership of individual objects to the approximations. As a result, the fuzzy lower and upper approximations are redefined by employing the concept of and as shown in

It can be observed from (17) and (18) that every is taken into account with instances where their corresponding . A detailed discussion on the usage of the and operators can be found in Radzikowska and Kerre [65], where a comparative study of FRS represented by specific implication and -norm have been presented.

3.2.4. Fuzzy Rough Set Feature Selection (FRFS)

The feature selection or reduction ability of RST is perhaps a significant factor owning to its successful application in diverse disciplines. This unique ability can be exploited in the fuzzy rough set via the concept of fuzzy lower approximation for reducing datasets of real-valued features. Referring to the extension principle of Zadeh [66], the membership of an object belonging to a fuzzy positive region is expressed as

From (19), it can be deduced that the object fails to belong to the positive region if the equivalence class it belongs to is not a member of the positive region. Recall (14), the fuzzy rough dependency degree function is expressed from the definition of the positive region:

However, for the FRFS to be useful in practice, it should be capable of handling high-dimensional datasets by means of estimating the dependencies of various feature subsets with the original dataset. This is relevant as the objects may belong to several equivalence classes. For instance, in the crisp case in (7), consist of groups of objects that are indiscernible based on the features from . However, in the case of fuzzy, the Cartesian product of features from is considered in estimating , where each set in is an equivalence class. Hence, the extent to which an object belongs to the equivalence class could be estimated using the combination of constituent fuzzy equivalence classes, as shown in

Although the usefulness of FRFS in addressing uncertainty in data and also as a dimension reduction technique is widely known in the literature, the technique possesses some deficiencies [12, 67, 68]. First, the problem of complexity in estimating the Cartesian product for the equivalence classes renders the technique to be highly prohibitive and computationally expensive, especially when dealing with high-dimensional datasets. The Cartesian product of the fuzzy equivalence classes may not result in a family of fuzzy equivalence classes [12, 69]. Also, the scenario of obtaining a lower approximation outcome which may not necessarily be a subset of the upper approximation for some instances may occur [69]. This outcome is undesirable since it contradicts the theoretical viewpoint of the lower approximation being higher than its upper approximation, implying that there is less certainty in the lower than the upper approximations. Moreover, several research works have established that the classical FRS model is sensitive to misclassification (error or missing values) and perturbation (noisy information) which are noted to be the primary source of uncertainty in real-life applications [70, 71]. This limits the applicability of fuzzy rough sets in practice.

For these reasons, this study employs a comparative study of six different variants of the lower and upper approximations that define the extent to which the set of elements can be classified into a certain class as strongly or weakly. The variants considered are the Fuzzy Lower Approximation (FLA) and Fuzzy Boundary Region (FBR) proposed by Jensen and Shen, Vaguely Quantified Rough Sets (VQRSs) [72], Ordered Weighted Average (OWA) [73], Fuzzy Variable Precision Rough Sets (FVPRSs) [70], and the -Precision Fuzzy Rough Sets ( PFRSs) [74]. These variants were considered as they have been uniquely designed to address some limitations of the classical FRS model through enhanced estimation capabilities for the lower and upper approximations as well as their degree of dependency parameters for selecting relevant features.

(1) Fuzzy Lower Approximation (FLA)-Based Feature Selection. Unlike the previously discussed FRFS which employs a fuzzy partitioning of the input space for determining the fuzzy equivalence classes (refer to Section 3.2.4), the FLA proposed by Jensen and Shen [12] utilises a -transitive fuzzy similarity relation found in Radzikowska and Kerre [65] as an alternative for estimating the fuzzy lower and upper approximations:where is the fuzzy similarity relation induced by a subset of features , and are the fuzzy implicator and -norm, respectively. Given a feature, say , the degree to which objects and are similar to feature and is expressed as

Based on (24), several fuzzy similarity (tolerance) relations such as (25)–(27) could be defined:where is the variance of feature . Similar to the FRFS (Section 3.2.4), the fuzzy positive region and the dependency degree functions are estimated using the following:

(2) Fuzzy Boundary Region (FBR)-Based Feature Selection. Proposed by Jensen and Shen [12], the FBR algorithm is based on the membership degree to the fuzzy boundary region expressed as the difference between the upper and lower approximations:with the fuzzy negative region for all decision concepts estimated using

From (30), the uncertainty for a concept based on features and the total uncertainty degree for all concepts are estimated using (32) and (33), respectively:

Thus, the FBR algorithm utilises the total uncertainty degree for all concepts of feature subset and decision attribute for selecting the optimal features.

(3) Vaguely Quantified Rough Set (VQRS)-Based Feature Selection. The VQRS was proposed by Cornelis and Jensen [72] with the notion that the standard FRS as well as some extensions are sensitive to noise and hence may be highly influenced during the reduction process (specifically, on the lower and upper approximations, (22) and (23)) when there is a change in any single object. Hence, to remedy this shortfall, the VQRS replaces the fuzzy lower (22) and upper (23) approximations with (34) and (35), respectively:where and are the fuzzy quantifiers for the lower and upper approximations, respectively. The performance of the VQRS model is dependent on the values of and parameters as expressed inwhere are parameter values.

(4) Fuzzy Ordered Weighted Average (FOWA)-Based Feature Selection. Similar to the VQRS, the Ordered Weighted Average (OWA) variant of FRFS was proposed by Cornelis et al. [73] for addressing noisy and outlying samples. In FOWA computation, the lower and upper approximations are estimated using an aggregation technique of OWA estimators (39) as indicated in (37) and (38), respectively:where is the largest value in and is the weighting vector such that for , , it is possible to define and as (40) and (41), respectively:

(5) Fuzzy Variable Precision Rough Set (FVPRS)-Based Feature Selection. The FVPRS was developed based on the concept of addressing the sensitivity of the standard FRS to misclassification (error or missing values) and perturbation (noise) [70]. For these reasons, the FVPRS was proposed by hybridising the FRS with the Variable Precision Rough Set (VPRS) which is known as the first model for handling class noise in data [75]. Thus, the FVPRS addresses not only data uncertainty but also is less sensitive to misclassification and perturbation. In FVPRS, the lower and upper approximations are defined in (42) and (43), respectively:where , and are the variable precision parameter, implicator, and -norm operators, respectively.

(6)-Precision Fuzzy Rough Set (BPFRS)-Based Feature Selection. The BPFRS algorithm as proposed by Salido and Murakami [74] uses the concept of -precision aggregation as a generalisation to Zarinko’s Variable Precision Rough Set (VPRS) [76] for addressing uncertainty in huge datasets. In its implementation, the -precision quasi--norm and -precision quasi--conorm are utilised in defining the -precision versions of fuzzy lower and upper approximations of a fuzzy set in expressed as (44) and (45), respectively:where and are the -precision quasi--norm and -precision quasi--conorm, respectively, such that when given a -norm , a -conorm , and , and of the order are mappings for all . is the greatest element of , is the smallest element of , , and .

3.3. Stacked Ensemble

Considering the fact that the primary goal of every decision support system is to always produce reliable and accurate outcomes for every classification task, researchers are constantly being faced with the decision of selecting the most accurate classifier for various tasks even though they are task-specific [76, 77]. In practice, the ideal approach will be to use trial and error where the candidate classifiers will be tested on a given problem to ascertain and select the one which produces the most accurate result. Whiles this trial-and-error approach may eventually yield an optimal outcome, it demands a lot of resources such as time. For these reasons, considerable resources have been dedicated to the combination of different classifiers (ensembles or metaclassification) as an alternative for achieving the desired results. Thus, the strategy of the ensemble is to systematically combine the output of classifiers such that it yields better results than any of the single classifiers.

Stacking, also known as stacked generalisation which was proposed by Wolpert [78], can be considered among the influential of such ensemble schemes [79]. That is, unlike other metaclassification schemes such as bagging and boosting which generate attributes via the same (homogeneous) learning algorithm and combining their outputs based on a predetermined scheme (nontrainable), stacking generates attributes from the predictions of multiple (heterogenous) base learners, and subsequently combining these outputs through one metaclassifier [77]. Hence, each learner uses a different approach to obtaining knowledge, biases, etc., and as such explores various hypothetical spaces from a diverse perspective. As a result, the classification from the resulting stacked ensemble is known to yield a more accurate and robust outcome than any of the individual base classifiers. Consequently, the stacked ensembles provide a naturally suited remedy for many large-scale data analysis and autonomous systems which fuses heterogenous data from multiple sources [80].

In combining the predictions from the multiple and heterogenous base classifiers, stacking utilises the concept of a metaclassifier (Level-1) to generate the final output by modelling a -fold-like cross-validated predictions (meta instances) from the base-classifiers (Level-0) as shown in Figure 5. First, the selected features from FRFS were split into training, validation, and testing sets. Using the training dataset, each base-classifier is trained, and predictions are made based on the validation set. The predictions from the base classifiers are used as features for training the metalearner. This implies that the metaclassifier is trained to synergistically combine the outputs of the base-classifiers for predicting out-of-sample instances (test set).

However, one major challenge with stacking is the optimal combination of base classifiers and the choice of a single metaclassifier since the configuration is application-specific [77, 8183]. The challenge is further complicated when dealing with a large search space [77]. In the literature, the usage of diverse or heterogenous base classifiers for obtaining the meta instances has been noted to efficiently address this challenge [84]. Hence, as the base learners, this research employs diverse learners such as the SVM, the Multilayer Perceptron (MLP), -Nearest Neighbour (-NN), C4.5 Decision Trees (C4.5 DT), Logistic Regression (LR), and the LDA. Regarding the metalearner, complex metaclassifiers are rarely used in the literature since they are likely to overfit the predictions from the base-classifiers. For this reason, simple models such as the Logistic Regression (LR) are utilised due to the advantage of producing piecewise linear approximations, less prone to overfitting and also for simpler interpretations of the resulting output [85]. Hence, the schematic for the proposed stacked ensemble is shown in Figure 6.

3.3.1. Support Vector Machine (SVM)

The concept of SVM is basically to create hyperplane(s) (decision boundary) in a high-dimensional space, where the best hyperplane is denoted as the one that optimally partitions the data into distinct classes and with the largest separation of classes [25, 26, 28]. The technique is known for its high performance, efficiency, and robustness in both classification and regression tasks; thus, SVM is one of the highly preferred classifiers in the field of predictive maintenance.

In SVM computation, the extent of the separation (margins) is determined and maximised via a kernel function (i.e., linear, radial basis, polynomial, and sigmoid). However, the frequently used kernel function, the Radial Basis Function (RBF) is adopted for its excellent general performance, wider convergence domain, and high-resolution power and requires fewer parameters [8688]. The RBF kernel function is expressed inwhere for a given training sample , and response vector , where −1 and +1 represent samples from the negative and positive classes, respectively.

The SVM problem is formulated as a minimisation problem as shown insubject to

Here, represents a nonlinear mapping function for transforming to a high-dimensional space, and are the weight vector and bias respectively, and are slack variables. Hence, the SVM aims to estimate and such that the maximum separation (margins) of the classes into distinct groups is achieved. However, in cases where the maximum separation is suboptimal, soft margins which are represented as inequalities ((49)–(51)) are used:

When an error occurs, and the upper bound on the training error is controlled by the Lagrangian shown inwhere is the Lagrange multipliers for estimating the positive values of .

3.3.2. Multilayer Perceptron (MLP)

The MLP neural networks are the most frequently used feedforward neural networks due to their simplicity, efficiency, and versatility in various research problems including predictive maintenance tasks [23, 24, 89]. The MLP is generally organised in three parallel layers: input, hidden, and output. The layers are connected as a summated linear combination of weight and bias vectors expressed mathematically aswhere the weight vector ranging between [−1, 1] connects the node of the input layer and the node of the hidden layer of neuron(s), is the input feature with number of inputs, and is the bias (threshold) of the hidden node.

The output of each hidden node is then estimated using an activation function. Though there are several forms of activation functions for MLP, this paper utilises the most commonly used sigmoid function as shown in

Based on the outputs from the hidden nodes , the final outputs are then estimated using

With this setup, the estimates of the weights and biases are fine-tuned and updated until minimal error is obtained.

3.3.3. -Nearest Neighbour (-NN)

The -NN is a nonparametric supervised learner and perhaps the simplest form of classification as it is based on the concept of similarity (distance between objects). As such, it is often used in the field of predictive maintenance [90, 91]. In -NN model computation, the commonly used distance measure is the Euclidean expressed aswhere is the -dimensional input vector and is an unknown or test sample.

3.3.4. C4.5 Decision Tree (C4.5 DT)

C4.5 DT is a nonparametric supervised learner consisting of decision and leaf nodes. In C4.5 DTs, the model is developed in a flowchart-like scheme where the search space is broken down into smaller subspaces while an associated decision tree is incrementally developed [92]. C4.5 DT has been employed in various predictive maintenance problems [93, 94].

The computation of C4.5 DT follows the concept of divide-and-conquer using a set of training data with a corresponding class vector . The feature space of the training data is recursively partitioned such that instances with the same class output are grouped together [95]. Suppose that denotes the data at node comprising of instances, the splitting of each candidate consisting of a feature and threshold is achieved by partitioning the data into and subsets as shown in (57) and (58), respectively:

Using (57) and (58), the quality of candidate split of node is then estimated usingwhere is the impurity function expressed by either of the three common impurity measures shown in (60) to (62), respectively:where is the proportion of class observations in node expressed as

The objective of the algorithm is to select parameters that minimise the impurity as shown in

The subsets and are recursed until the maximum allowable depth, or , is achieved.

3.3.5. Logistic Regression (LR)

LR is a commonly used machine learning technique for classifying responses with binary outcomes. The LR model was originally proposed for modelling population growth, yet its application over the years is found in numerous disciplines due to its simplicity and effectiveness. Despite the simple nature of the classifier, it possesses some desirable characteristics such as high accuracy, interpretability, and less influence from error in data when optimally trained [96]. Hence, it is commonly used in stacked generalisation, mostly as a metalearner due to its easier interpretability characteristics [97100].

Given a matrix with a corresponding response vector , the LR model aims to minimise the negative log-likelihood of the data expressed aswhere is defined as

The regression coefficients are expressed as , and assume that .

3.3.6. Linear Discriminant Analysis (LDA)

LDA is among the predominantly used classifiers for predictive maintenance purposes [29, 31, 101]. Its concept exploits the statistical features such as the mean and covariance matrix of each class and then utilises mathematical processes and functions for classifying multiple classes. In LDA, the computation starts by calculating the within-class (67) and between-class (68) scatter matrices:

However, in (67) is the covariance matrix estimated using

Hence, (67) is expanded as shown inwhere and are the scatter matrices within and between the classes, respectively, is the priori probability of class , and are the mean of the different classes and grand mean, respectively, and is the input vector.

Using (67) and (68), a discriminatory power value is estimated for deriving a projection matrix that maximizes the following equation:

That is, (71) is maximised when the eigenvectors of the matrix are estimated. This results in the linear discriminant function as shown in

3.4. Classification Performance Evaluation

For the purpose of reliability, nine evaluation metrics, namely, accuracy, error rate, sensitivity, specificity, precision, F-score, Mathews correlation coefficient (MCC), geometric mean, and area under curve (AUC) were used.

3.4.1. Accuracy

Accuracy is one of the most widely used evaluation metrics for assessing the performance of classification algorithms [102104]. Classification accuracy is expressed as follows:where is the number of correct classification counts when there is a fault condition, is the number of correct classification counts when there is no fault condition, is the number of misclassification counts when there is a fault condition, and is the number of misclassification counts when there is no fault condition for a specific classification model.

3.4.2. Error Rate

This metric is one of the main indicators for evaluating classification performance by measuring the errors (misclassifications) incurred by a classifier [104107]. The error rate is expressed as shown in

3.4.3. Precision

The precision of a classifier measures the exactness of classification after prediction [104, 107, 108]. The evaluation metric is expressed as a ratio of true positives to the sum of true positives and false positives as shown in

3.4.4. Sensitivity

Recall, also known as sensitivity, is a measure of a classifier’s capacity to determine positive instances. It measures the fraction of positive instances that are correctly classified [104, 108, 109] and is expressed as a ratio of true positives to the sum of true positives and false negatives as shown in

3.4.5. Specificity

Specificity measures the fraction of negative instances that are correctly classified [108, 110]. That is, the metric denotes the test’s ability to identify negative results as expressed in

3.4.6. F-Score

The F-score describes the overall performance of a classification model as the harmonic mean of precision and recall [104, 107, 108]. The metric ranges from zero to one, with high values indicating high classification performance and vice versa. F-score is given in

3.4.7. Matthews Correlation Coefficient (MCC)

MCC (79) measures the relationship between the observed and the predicted classification and is generally regarded as a balanced metric for evaluating the performance of classifiers even with varying class sizes [111, 112]. MCC, as compared to other classification evaluation metrics, is known to be more informative as it considers the balance ratios of the four confusion matrix categories. An MCC coefficient ranges from −1 to +1, with +1 suggesting a perfect classification while −1 implies a total misclassification:

3.4.8. Geometric Mean (GM)

GM (80) is an aggregate of both sensitivity and specificity evaluation metrics. The metric seeks to maximise the rate of true positive and negative instances whiles maintaining a balance between both rates [113, 114], thus making the metric suitable for imbalanced datasets:

3.4.9. Area under Curve (AUC)

In evaluating the classification performance of classifiers, metrics such as accuracy, specificity, and precision are prone to the changes in the class distribution of the test data; thus, they are not always robust [115]. That is, they may underperform when the ratio of positive to negative instances changes. For this reason, the AUC [116] metric which is insensitive to changes in class distribution (nonparametric) is also employed. The AUC statistic is expressed as shown inwhere and is the number of positive and negative instances, respectively. is the scores predicted by the model for the whilst is the scores predicted by the model for the . is an indicator function satisfying the condition and .

4. Results and Discussion

4.1. Fuzzy Rough Set Feature Selection (FRFS)

The FRFS experiment was carried out in the R statistical environment. The algorithms were executed on an i3 ASUS computer with a 2.3 GHz processor and 16 GB memory. Table 2 shows the number of features selected with the six FRFS methods compared in this study. As observed, feature(s) ranging from 1 to 27 were deemed relevant to have contained the required variability in classifying accumulator conditions. Similarly, 1 to 8 feature(s) were selected for the classification of cooler conditions. Feature(s) ranging from 1 to 21 were noted as informative enough for classifying pump conditions by the various FRFS methods. Concerning the valve conditions, 1 to 5 feature(s) were selected. These selected features by the various FRFS methods were noted to contain enough variability for explaining the conditions of the hydraulic system. This subsequently will improve classification accuracy whiles reducing the computational cost, overfitting, training runtime, and uncertainty in modelling. However, regarding the number of features selected, the FLA, FBR, FOWA, and FVPRS produced on average 15 features whiles VQRS and BPFRS produced the least average selected feature(s) of 1 and 2, respectively.

Figure 7 shows the respective runtime during the implementation of the various FRFS methods for feature selection. As observed in Figure 7(a), the highest runtime range of 1884.98 to 18497.54 seconds was recorded by the various FRFS methods in producing the most informative feature(s) for classifying accumulator fault conditions whiles a minimum range of 1069.64 to 4141.75 seconds was recorded during valve conditions (Figure 7(d)). The high runtime during the FRFS implementation on the accumulator dataset suggests the high complexity and the high degree of uncertainty of the data. This supports the assertion made by prior works that the classification of accumulator conditions is noted as the most challenging among the monitored hydraulic components (accumulator, cooler, pump, and valve) [29, 44]. Similar but lesser complexity and degree of uncertainty were also seen in the pump (Figure 7(c)) and cooler (Figure 7(b)) datasets, respectively.

4.2. Optimal FRFS Method for Various Fault Types

Based on the varying complexity and uncertainty levels indicated by the datasets of the four monitored hydraulic components (accumulator, cooler, pump, and valve), it is prudent to ensure that the optimal FRFS method for selecting the most informative features for adequate classification is employed. For this reason, the selected features from each FRFS method for the various fault conditions were used as inputs for training the proposed stacked ensemble (ESB(STK)) in order to ascertain their performance.

However, before training, the selected features were split into train and test sets with a ratio of 70 : 30. The train set was used for model development whiles the remaining test was used for model validation purposes. Due to the stochastic learning characteristics of the base classifiers used for the stacked ensemble, slightly different output may be achieved after each run. As a result, the proposed stacked ensemble (ESB(STK)) was trained 10 times and the final test output is presented as the average from the 10 runs. Finally, to ensure that the optimal hyperparameters for the base classifiers as well as the generaliser are set, the random grid search algorithm [117] was implemented. Tables 3 to 6 show the test performance of the feature(s) selected from the various FRFS methods used as inputs for classifying the accumulator, cooler, pump, and valve conditions.

Table 3 shows the test performance of the various FRFS methods when their corresponding extracted features were used as inputs to classify accumulator conditions. As observed, all the competing FRFS methods except for BPFRS produced satisfactory classification accuracies greater than 90%. In comparison (Table 3), the VQRS proved superior in classifying the accumulator conditions to the other explored FRFS methods. The smaller number of four features (refer to Table 2) selected by the VQRS implies less complexity and runtime during the learning and implementation phase of the proposed stacked ensemble. Thus, comparing the FRFS methods (excluding BPFRS), the VQRS was executed with a low runtime of 3989.43 seconds (Figure 7(a)). It can be further deduced from the results (Table 3) that the high performance, low runtime, and low features selected by VQRS suggest that the method could produce optimal and most informative features for the adequate classification of accumulator fault conditions. Thus, the VQRS is chosen as the optimal FRFS method when it comes to dealing with the imperfections or uncertainty within the accumulator dataset.

Unlike the complex nature of the accumulator dataset, the complexity of classifying the cooler conditions was less. This was evident by the low minimum runtime range of 1554.47 to 6462.43 seconds during the feature selection phase by the FRFS methods (Figure 7(b)). Likewise, the variability in cooler conditions (class output) could be adequately explained by a low number of features (i.e., from 1 to 8 features, out of the pool of features). The less complex nature is further ascertained as most of the FRFS methods except for FLA and VQRS produced perfect test classification scores with 100% accuracy and AUC scores as shown in Table 4. Hence, any of the four FRFS methods (i.e., FBR, FOWA, FVPRS, and BPFRS) could be utilised as a feature selection method on the cooler dataset. However, for industrial applications, a high-performance system with the least computational cost and low runtime is highly preferred when developing frameworks for predictive maintenance purposes. On that note, the BPFRS having achieved high classification performance and the least runtime (1554.47 seconds in Figure 7(b)) is selected as the optimal method for classifying cooler conditions.

A similar data complexity observed in the accumulator was seen in the pump dataset. As a result, a high runtime ranging from 1484.64 to 14026.72 seconds (Figure 7(c)) was recorded during the implementation of the FRFS methods for feature selection. Although all the FRFS methods yielded satisfactory test classification results (Table 5), the FVPRS achieved a perfect accuracy and an AUC score of 100%. For this reason, the FVPRS is chosen as the optimal FRFS method for addressing the imperfections or uncertainty in the internal pump leakage dataset.

With regards to the classification of the valve conditions based on features selected by the various FRFS methods, all the methods accurately classified the reserved test dataset as shown in Table 6. This implies that the variability in valve conditions could be explained adequately by any of the feature(s) selected by the FRFS methods. However, based on minimum runtime and less complexity in learning, the FVPRS which recorded the least implementation runtime (1069.64 seconds in Figure 7(d)) is chosen to be the optimal method in the case of the valve dataset.

4.3. Comparison of Proposed Stacked Ensemble (ESB(STK)) with Existing Classifiers

In order to ascertain the generalisability of the proposed stacked ensemble (ESB(STK)), a comparative analysis is performed on the six base classifiers (standalone) and three existing ensemble classifiers. The existing ensembles compared with the proposed ESB(STK) are the Stochastic Gradient Boosting (SGB) [118], AdaBoost (ADB) [119], and Bagging (BAG) [120]. In all, the proposed ESB(STK) was compared with 9 well-established classifiers that are commonly used in numerous fields of autonomous systems such as pattern recognition, system monitoring, and fault identification, among others. Tables 710 show the classification test results for the four hydraulic components. It is important to note that the selected optimal FRFS methods for classifying the various conditions of the four monitored hydraulic components were used. Here, the VQRS selected features were used as input to train the nine classifiers (SVM, MLP, KNN, C4.5 DT, LR, LDA, SGB, ADB, and BAG) for classifying the operational conditions of the accumulator. Similarly, BPFRS selected features were used as input to classify cooler conditions. For the pump and valve, the FVPRS selected features were used in the aforementioned classifiers. For the purposes of clarity, the FRFS method that produced the optimal features extracted for each hydraulic component will be indicated with the stacked ensemble, ESB(STK).

Table 7 shows the classification performance of the proposed VQRS-ESB(STK) with the competing classifiers on accumulator conditions. As observed, though almost all the classifiers under consideration yielded satisfactory results, the proposed VQRS-ESB(STK) achieved the highest classification results in terms of accuracy (0.9743), AUC (0.9951), least error rate (0.0257), and the remaining performance metrics. This was followed by the SVM (0.9698), KNN (0.9653), and SGB (0.9637), respectively. On average, the proposed VQRS-ESB(STK) gains an improvement in accuracy and AUC of about 11.28% and 4.35%, respectively.

Similar performance (Table 8) was obtained for FVPRS-ESB(STK) for classifying the internal pump conditions. It can be seen that the proposed FVPRS-ESB(STK) produced the best classification results as compared with the other competing classifiers. Among the compared classifiers, SVM, KNN, SGB, and BAG produced relatively similar results as the proposed FVPRS-ESB(STK) in terms of AUC scores (1.0000). This suggests that the SVM, KNN, SGB, and BAG classifiers could be considered alternatives to the proposed ESB(STK) when considering the probability thresholds of the classification results.

Regarding the cooler and valve conditions classification results (Tables 9 and 10), very competitive outputs were recorded for all the standalone classifiers, ensemble classifiers, and the proposed BPFRS-ESB(STK) and FVPRS-ESB(STK), respectively. In comparison, all the considered classifiers accurately classified the conditions of the cooling system of the hydraulic machinery. Similar competitive classification except for LDA and ADB classifiers was witnessed when classifying the valve conditions. These high accuracies and competitive results could be attributed to the less complicated variability in the class outputs of the cooler and valve conditions as discussed earlier in Section 4.1.

Taking into account the results from Tables 710, it can be ascertained that none of the standalone classifiers which served as the based learners in the stacked ensemble achieved the best classification across all monitored hydraulic components (accumulator, cooler, pump, and valve). However, the proposed framework of FRFS method coupled with stacked ensemble (ESB(STK)) has proven to be generalisable and versatile in the classification of all conditions of the various hydraulic components considered. Based on the generalisability and versatility of the proposed framework, one can opt for the proposed framework as opposed to investing resources into determining which standalone classifier to select for a specific task.

4.4. Impact of Base Classifiers on the Proposed ESB(STK)

In building a stacked ensemble, the selection of appropriate base classifiers is critical for ensuring goodness of fit. For this reason, it is imperative to ascertain how the proposed stacked ensemble (ESB(STK)) is affected by the quality of base classifiers. This objective is achieved by dropping each base classifier at a time and then assessing its classification performance on the four hydraulic components (accumulator, pump, cooler, and valve). Using this approach, the base classifiers with high impact on the overall performance of the proposed ESB(STK) were determined. Here, the developed stacked ensemble is defined as modified ESB(STK).

Table 11 presents the impact of base classifiers on the overall performance of the proposed VQRS-ESB(STK) in classifying accumulator conditions. As observed, all the base classifiers contribute to the overall performance of ESB(STK) since none of the modified ESB(STK) achieved the same performance as the proposed VQRS-ESB(STK). This can be confirmed in Table 11 where a maximum deterioration in accuracy (1.55%) is observed when KNN is dropped, followed by MLP (1.47%), LR (1.02%), LDA (0.68%), C4.5 DT (0.46%), and SVM (0.16%), respectively. This suggests the relative contribution made by each of the base classifiers to the overall VQRS-ESB(STK). A similar observation is seen in the case of the pump conditions where the classification performance is reduced when each base learner is dropped (Table 12).

In the classification of valve conditions (Table 13), all the modified ESB(STK) showed no improvement after dropping any of the base classifiers. Similar results were observed for cooler conditions (Table 14); however, a 0.15% deterioration in accuracy was recorded when either MLP or LR was dropped from the proposed BPFRS-ESB(STK).

The summary of the relative impacts of the base classifiers on the proposed ESB(STK) in classifying the condition of the four monitored hydraulic components is shown in Figure 8. Using the accuracy estimates from Tables 11 to 14, Figure 8 was generated with the relative impact of each base learner (i.e., the difference in accuracy between the proposed FRFS-ESB(STK) and the Modified ESB(STK)). A critical observation of Figure 8 ascertains that the impact of each base classifier varies among the hydraulic components. The highest impact was seen when classifying the conditions of the accumulator followed by the internal pump leakage. Lesser impact was observed for cooler conditions whilst no impact was seen for valve classification when any of the base classifiers was removed. Figure 8 has proven the proposed ESB(STK), to be generalisable and versatile in the classification of all conditions of the various hydraulic components.

Although the proposed FRFS-ESB(STK) framework proved superior in classifying the operational conditions of the hydraulic components, the implementation of FRFS is highly constrained by the available computational resource and the number of input features. Thus, the selection of relevant features using FRFS may be challenging when implemented on high dimensional datasets, especially with the unavailability of substantial computational resources such as memory and space. For these reasons, future works should explore FRFS alternatives with lesser computational demands or autonomous feature selectors with no human intervention. Also, regardless of the performance of ESB(STK), the concept of stacking assumes that each base learner contributes equally to the ensemble classification. However, the discussion from Section 4.4 indicates otherwise. Hence, future works should explore stacking with varying weighing schemes. This will ensure that high performing base learners contribute more whiles less performing contribute less to the ensemble classification.

5. Conclusion

A systematic and synergistic PdM framework based on the hybrid multisensor FRFS and the stacked ensemble (ESB(STK)) has been proposed for the efficient classification of fault conditions characterised by uncertainties. The proposed hybrid framework (FRFS-ESB(STK)) improved the classification accuracy with optimal feature subset size whiles reducing the computational cost, overfitting, training runtime, and uncertainty in modelling. It was however observed that the optimal FRFS method was task-specific, that is, no individual FRFS method produced an optimal number of features for all fault conditions due to their respective degree of complexity and uncertainty. It was also observed that no single base classifier (independent) achieved the best classification results across all the monitored hydraulic components as frequently observed in research works in fault classification. However, the proposed FRFS-ESB(STK) framework proved to be generalisable and versatile in the classification of the condition of all the hydraulic components considered. Based on the results obtained, it was concluded that one can opt for the proposed framework as opposed to investing resources into determining which standalone classifier to select for a specific task since the impact of the base classifiers varied among the hydraulic components. In summary, the proposed hybrid framework (FRFS-ESB(STK)) advances the field of PdM by effectively and efficiently fusing recorded data from multiple and varying sensors. Thus, the proposed FRFS-ESB(STK) framework has enormous potential for enhancing the reliability of autonomous systems during an unexpected sensor failure. This enhances the robustness and confidence in fault detections as well expanding the sensitivity and specificity ability of systems to efficiently discriminate between operating conditions. Despite the high classification rate of the proposed FRFS-ESB(STK), the framework may be limited by computational resources for FRFS implementation and the nonvarying weighting scheme of ESB(STK). Hence, future works should explore alternative feature selectors with no human intervention and stacked ensembles with varying weighing schemes.

Data Availability

The data used to support the study can be accessed publicly at the University of California Irvine (UCI) machine learning repository via http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems.


This manuscript is an extension of prior studies by Buabeng et al. [45] and a preprint available at Research Square, Buabeng et al. [46] which can currently be accessed via https://www.researchsquare.com/article/rs-600110/v1.pdf?c=1631887023000.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Albert Buabeng and Yao Yevenyo Ziggah contributed to the conceptualization of the research and the development of the methodology. Albert Buabeng performed the analysis and wrote the original draft. Anthony Simons, Nana Kena Frempong, and Yao Yevenyo Ziggah reviewed and suggested comments to improve the standard of the manuscript. All the authors have read and agreed to the published version of the manuscript.


This research is part of a PhD research funded by the Staff Development Policy at the University of Mines and Technology, Tarkwa, Ghana.