Abstract

Prognostics and health management (PHM) is a framework that offers comprehensive yet individualized solutions for managing system health. In recent years, PHM has emerged as an essential approach for achieving competitive advantages in the global market by improving reliability, maintainability, safety, and affordability. Concepts and components in PHM have been developed separately in many areas such as mechanical engineering, electrical engineering, and statistical science, under varied names. In this paper, we provide a concise review of mainstream methods in major aspects of the PHM framework, including the updated research from both statistical science and engineering, with a focus on data-driven approaches. Real world examples have been provided to illustrate the implementation of PHM in practice.

1. Introduction

To fulfill the increasing demand on functionality and quality, modern systems are often built with overwhelming complexities. These systems are often featured rich electronics and intricate interactions among subsystems/components. For example, a typical car consists of about 2,000 functional components, 30,000 parts, and 10 million lines of software code [1].

Additionally, extremely high requirements of system reliability are essential since a single failure can result in catastrophic consequences. Despite every effort made in the past, disasters keep occurring with profound implications. In June 2009, the Metro rail crash in Washington D.C. killed nine people and injured dozens more, suspiciously due to sensor circuit “anomalies” under the rail track [2]. Brazil blackouts in November 2009 affected more than 60 million people and shut down everything from subway to light bulbs [3]. Despite the explanation attributed to lightning, wind, and rain, it was still believed that “there was obviously some failure, either technical or human.” Other examples include the failure of LED lighting system in Xiamen, China, two months after installation although the manufacturer promised five years’ lifespan of their products, not to mention the infamous sudden acceleration failures of Toyota automobiles which have significantly damaged the company’s profit and reputation [4].

In view of the high impact and extreme costs usually associated with system failures, methods that can predict and prevent such catastrophes have long been investigated. Applications of developed methods are not rare in domains such as electronics-rich systems, aerospace industries, or even public health environment [5, 6]. In general, these methodologies can all be grouped under the framework of prognostics and health management (PHM). Particularly, prognostics is the process of predicting the future reliability of a product by assessing the extent of deviation or degradation of the product from its expected normal operating conditions; health management is the process of real time measuring, recording, and monitoring the extent of deviation and degradation from normal operation condition [7, 8]. Different from traditional handbook based reliability prediction methods (e.g., U.S. Department of Defense Mil-Hdbk-217 and Telcordia SR-332 (formerly [9])), which assumed that constant hazard rate of each component can be tailored by independent “modifiers” to account for various quality, operating, and environmental conditions, PHM methodologies instead monitor the health state in real time and dynamically update the reliability function (hazard rates) based on in situ measurements and tailored evolution models obtained from historical data. Due to the success of existing PHM methodologies, there is no need to exhibit the growing interests of studying new PHM techniques and applying PHM to underdeveloped domains.

Nevertheless, the increasingly complex modern systems pose new challenges on PHM. One of the most prominent problems is called No Fault Found (NFF) problem (related terminologies include “cannot duplicate,” “re-test OK,” “trouble not identified,” and “intermittent malfunctions”) [1012], particularly in electronic-rich systems. As the name suggests, it refers to the situation that no failure/fault can be detected/replicated during laboratory tests even when the failure has been reported in the field [13]. NFF issues not only make the prognosis and diagnosis extremely difficult, but also can cause skyrocketing maintenance cost. As reported by Williams et al. [13], NFF failures account for more than 85% of all filed failures and 90% of overall maintenance costs in avionics; it is estimated that NFF related activities cost the U.S. Department of Defense 2~10 billion U.S. dollars per year [14]. Evidently, NFF contributes a lot in operational cost in many different application areas. On top of the maintenance costs, potential safety hazards related to NFFs are even more striking. For example, both Toyota and National Highway Traffic Safety Administration (NHTSA) spent quite a long time to investigate the root causes of sudden acceleration failures in some car models, a problem that might be linked to 89 deaths in 71 crashes since 2000 according to NHTSA [15]. Unfortunately, no conclusive finding has been reached despite the efforts in trying to repeat the failures in a variety of laboratory conditions. These intermittent faults are also suspected to be the main reason for other catastrophes, like Washington Metro crashes and Brazil blackouts.

Intermittent faults or NFF problems pose significant barriers to apply traditional methods for reliability prediction, which are often empirical and population based. From the aforementioned examples, we can find that intermittent faults are often tightly related to the environmental conditions and operation histories of the particular individual system. They can hardly be repeated due to unknown random disturbances involved. Therefore, laboratory testing and assessment can only provide a reference on the “average” characteristics of the whole population and are insufficient to provide accurate modeling and prediction for each individual. To reduce the maintenance cost and eliminate safety hazards caused by NFF, the paradigm of PHM needs to shift from empirical to data supported and from population based to individual based.

Companioned with the challenges, the fast development of information and sensing technology has enabled the collection of many in situ measurements during operations and provided the capability of real time data management and processing for each individual. These advancements provide us great opportunity to develop sophisticated models with increasing accuracy of prognostics for individual items. For instance, many different types of data during the whole life cycle of the products can be easily retrieved, especially in critical applications. These data could include production process information, quality records, operation logs, and sensor measurements. Moreover, unlike manually entered data used before which are slow, costly, and error-prone, most of current records are automatic, accurate, and timely entries due to the advancements of technology. The use of Radio Frequency Identification (RFID) technology, for example, is not rare in supply chain distribution network, healthcare, and even military applications, to provide reliable and timely tracking or surveillance of products/components. Advanced sensor technologies also enable abundant measurements at both macro and micro scale, such as vibration, frequency response, magnetic fields, and the current/voltage, to name a few.

In response to the emerging challenges as well as opportunities, this paper tries to review recent advancements of PHM methodologies, with a focus on data-driven approaches, and their applications in practice, and identify research problems that may lead to further improvement of PHM in both theory and practice. Before we move on to the next section, we would like to use an example from real practice to better illustrate the ideas in PHM.

A Motivating Example. Bearings are found in most mechanical systems with rotational components. They provide necessary support as well as constraining the moving parts to desired motion mode. It is hard to overemphasize the importance of keeping bearings under normal working conditions in engineering applications. Breaking down of a single bearing may cause failure of an entire system. For example, on August 30th 2010, a Qantas Boeing 747 aircraft departing from San Francisco International Airport encountered an accidental engine shutdown, which has later been confirmed to be caused by a fractured turbine blade and a failed bearing [16].

As a bearing tends to exhibit larger vibration as it degrades, its health condition can thus be assessed via the vibration signal collected by sensors. Such signal is often referred to as degradation signal in the literature of PHM. When the amplitude of a bearing’s vibration exceeds certain threshold, the bearing can be considered as no longer suitable for further operations. Figure 1 demonstrates the degradation paths of three different bearings, where the -axis is the working time of bearings (in minutes), the -axis is the average amplitude of the vibration at different harmonic frequencies, and the horizontal line is the vibration threshold considered as indicator of bearing failures [17, 18].

Unlike conventional reliability analysis which mostly provides population-based assessment, individualized prediction results are possible by taking advantage of the degradation signal. Based on the vibration data collected up to the current time, we can build models to predict the evolution path of vibration signal in the future and consequently predict the remaining useful life (RUL) of an in-service bearing by assuming its failure as the vibration signal hits the threshold for the first time. Unfortunately, many factors in the degradation make the exciting task very challenging. Figure 1 demonstrates several important features of the degradation signals. For example, the degradation of bearings exhibits two distinctive phases. At the initial stage, the vibration signals emitted are small and stable. However, after a change point (often a crack appears), the magnitude of vibrations increases dramatically and features large variability. Despite similar shapes of degradation paths, the location of changing point, increasing rate of vibration magnitude, and so forth vary from one bearing to another. We will return to this example in Section 3 for more technical details.

The rest of the paper is organized as follows. Section 2 reviews advancements in PHM. Section 3 uses three examples to illustrate the procedures and strategies of PHM. Section 4 concludes the paper with summing up comments and future works that drive PHM further in academia and industries.

2. Overview of Data-Driven PHM Approaches

In general, typical workflows in a PHM system can be conceptually illustrated, as shown in Figure 2. Three major tasks can be identified in the flowchart: fault diagnostics, prognostics, and condition-based maintenance. The first task is to diagnose and identify the root causes of system failures. The root causes identified can provide useful information for prognostic models as well as feedback for system design improvement. The second task takes the processed data and existing system models or failure mode analysis as inputs and employs the developed library of prognosis algorithms to online update degradation models and predict failure times of the system. The third task makes use of the prognosis results (e.g., the distribution of remaining useful life) and considers the cost versus benefits for different maintenance actions to determine when and how the preventive maintenance will be conducted to achieve minimal operating costs and risks. All of these three tasks need to be executed dynamically and in real time.

Other than these three major tasks, there are also some other important components listed in Figure 2. Nevertheless, they are often prepared offline and only timely updating may be needed during the system operations. For example, signal processing/feature extraction is the procedure to preprocess the signals using rules or methods developed according to engineering knowledge, expert experience, or statistical findings from historical data. They serve the purpose to eliminate noise, reduce data dimensions (complexities), and transform the data into proper space for future analysis. Similarly, prognosis and diagnosis algorithms can also be developed offline to cater the special characters of the signals and system properties. Upon new arrival of sensing signals, appropriate algorithms can be selected to compute distribution of RUL, determine maintenance actions, or find root causes of abnormalities.

Due to its undeniable importance, recent years have seen prosperous development in different aspects of PHM. The reviews on statistical data-driven approaches by Si et al. [19] and Sikorska et al. [20] have covered most of the models used in RUL estimation with a statistical orientation, but our work focuses on a wide range of the varied models in PHM methodologies from diagnosis to prognosis with their motivations. The subsequent sections are devoted to discussion of research progress and open issues of these tasks/components in PHM to provide us with an overview for further advancement.

2.1. Signal Processing and Feature Extraction

In current data rich environment, huge amounts of data are often automatically collected in a short time period. Different from the problem that very limited data was available decades ago, this overwhelming data poses new challenges in data management, analysis, and interpretation. Consequently, data preprocessing and feature extraction procedures become standard in many complex systems to improve data quality, reduce data redundancy, and boost efficiency of analysis. Due to its importance, many researchers have investigated this problem in the literature, as summarized in some of the review papers in different application areas (e.g., [2123]).

Instead of giving a comprehensive review on different techniques in the literature, in this section we will list some of the commonly used methods in the context of PHM. These techniques can be roughly classified as statistical methods and engineering knowledge based methods. In the first category, data can be transformed to optimize certain predetermined criteria without the input of the domain knowledge. For example, principle component analysis (PCA) and independent component analysis (ICA) have been used widely to reduce data dimensions. Similar to analysis of variance (ANOVA), distance evaluation technique (DET) [24] is also preferred and its varied versions are applied, such as a two-stage feature selection and weighting technique (TFSWT) via Euclidean distance evaluation technique (EDET) [25], a modified version of ANOVA, which takes both the difference between the variances in each group and the maximum versus the minimum differences between the mean of each group into consideration. However, certain useful information cannot be retained when dealing with highly nonlinear data, as reported in [26]. Other techniques include mutual information (MI) based method [27], self-organizing map (SOM) [28], and density based methods [29]. These techniques work well on nonlinear data and hence are employed broadly in many applications.

Methods in the second category, on the other hand, utilize some of the domain knowledge in the process of feature extractions. Based on the procedure of conditional-based maintenance, the data type could be summarized into three: value type (e.g., temperature, pressure, humidity, etc.), waveform type (e.g., vibration data), and multidimensional type (e.g., image data, X-ray images, etc.) [30].

Waveform data analysis is among the most common methods in diagnostics of mechanical systems, due to the popularity of waveform data collected from sensors, particularly in vibration signal analysis of rotating elements [31, 32]. Different kinds of techniques and algorithms are developed in this field. They can be categorized into time-domain analysis, frequency-domain analysis, and the combination of both. These methods often create features that have clear physical meanings or interpretations. For example, Table 1 summarizes ten commonly used time-domain features [25]. Instead of using the raw waveform data, these ten summary statistics provide us with extracted information of the signals. These ten features can be generally applied to many applications. However, if the mechanism of how the abnormalities influence the measured signal is known, extracted features based on domain knowledge may be more effective. For example, Lei and Zuo [25] summarized 11 statistical features specifically developed for gear damage detection. Another time-domain analysis approach is time synchronous average (TSA), popularly used in fault detection of rotating equipment [33, 34]. The idea is to use the average over a number of evolutions of raw signal in order to remove/reduce noise. Time series models are naturally applied here as well [35], in an attempt to extract features based on parametric models. For example, coefficients in a fitted autoregressive moving average model (ARMA) can be indicative of the health condition [36].

Meanwhile, it is believed that some faults will show certain characters in frequency domain. Fourier transform is the most common form of further signal processing, which decomposes a time waveform into its constituent frequencies. Fast Fourier transform (FFT) is usually used to generate the frequency spectrum from time series signals. A high vibration level at a particular frequency may be the signature of a particular fault type. Besides FFT spectrum, other methods such as cepstrum [37], high-order spectra [38], and holospectrum analysis [39] are also developed for fault diagnostics in the frequency domain.

One limitation of frequency domain analysis is its inability to handle nonstationary waveform signals, commonly observed during machine faults. A combination of both time and frequency domains, time-frequency analysis, has been developed to solve the problem [40, 41]. A typical method is called short-time Fourier transform (STFT) [42], which divides the whole waveform signal into segments with short-time window and then applies Fourier transform to each segment. Wavelet transform is another popular method with a similar idea. Wavelet analysis has been successfully applied to feature extraction and fault diagnostics in various applications (e.g., [4347]). A review on the application of wavelet analysis in machine fault diagnosis and fault feature extraction can be found in Peng and Chu [48]. Other methods of time-frequency analysis include spectrogram [49], Wigner-Ville distribution [5052], and Choi-Williams distribution [53].

2.2. Fault Diagnostics and Classification

Fault diagnostics is designed to efficiently and accurately identify the root cause of the faults. Effective diagnosis can not only reduce downtime and repair cost, but also provide useful information for prognostics to improve its accuracy. Fault detection is defined to be the task of determining if a system is experiencing problems. Fault diagnostics, then, is the task of locating the source of a fault once it is detected. Because of its importance, researchers from different fields have investigated the issue of fault diagnostics extensively, such as in manufacturing processes [54, 55], discrete event systems [56, 57], and communication systems and networks [58, 59]. We do not attempt to give a comprehensive review but focus on the approaches that are commonly used in PHM practices.

In general, methodologies in fault diagnostics can be classified into two categories: model based approaches and model free approaches. In the model based category, some forms of underlying models linking failure modes and observations are proposed. These models are often derived according to first principles and physical mechanisms. Based on the model structure and parameters, observations can be used to infer the root causes or the failure modes using different algorithms. In contrast, model free methods often do not assume the knowledge of underlying processes. Although in many cases implicit or statistical/surrogate models are used in the fault diagnostics, we use the name to emphasize that approaches in this category are purely data-driven without additional assumptions on the systems’ operation mechanisms.

In fault diagnostics, we would like to know the exact time when a fault appears, its location, and its severity. Therefore, the diagnostics consists of three aspects: anomaly detection, as we first identify any potential performance deviation from normal operation; fault localization, which localizes the problem to the specific component or subsystem; fault classification, which discriminates known and unknown faults and identifies the type of the fault if it is previously known [60]. Many popular methods from machine learning and artificial intelligence are applied in this context, such as support vector machine [61, 62], -nearest neighbors [25, 63], and decision trees [64, 65]. Among them, artificial neural network has been preferred by many engineers and widely applied to fault diagnostics of various engineering systems [28, 6669]. Unsupervised machine learning algorithms such as fuzzy -means and self-organizing maps have been applied when no response is provided [7072]. It is worth mentioning that the sensitivity to a given fault is often a function of operating conditions and the nature of the anomaly. Therefore, the environmental conditions need careful consideration. For example, self-organizing maps are used for regionalization of the system operating conditions [60]. Excellent reference for fundamentals of these methods can be found in Bishop [73], Hastie et al. [74], and Kotsiantis et al. [75].

2.3. Data-Driven Prognostics Method

As mentioned in the introduction, prognostics algorithms predict the future reliability of a product considering current and past health information collected. Through constant inspection, the observed health information is often referred to as condition monitoring (CM) data. CM data may be directly or indirectly related with the system health status and hence can be viewed as system health indicators. Examples of CM data are amount of tire wear, chemical concentration, size of a fatigue crack, power output of an amplifier, and the light intensity of LED. As a system degrades inevitably through usage, its health status deteriorates and is manifested through the observed CM data (e.g., the light intensity decreases as the LED degrades). Hence, CM is normally viewed as the system degradation signal. Failures are often defined as the degradation reaches a predetermined threshold set by experts. Thus, by modeling the evolution of degradation and calculating the time it first hits the failure threshold, we will be able to predict the system remaining useful life (RUL). Due to randomness in the evolution paths of the degradation, the calculated RUL will be in the form of some probability distribution. Two excellent comprehensive review papers in RUL research can be found in Si et al. [19] and Sikorska et al. [20]. The main difference between our paper and Si et al. [19] could be understood through the illustration in Figure 2. Our paper describes the entire process of data-driven approach related to PHM, addresses the three PHM objectives (fault diagnostics, prognostics, and conditional-based maintenance), and discusses additional prognostics approaches developed from other areas and relationships among different approaches. On the other hand, Si et al. [19] focus mainly on various modeling approaches for prognostics and do not address the tasks before the three PHM objectives. Sikorska et al. [20] focus on evaluating the various prognostics approaches from the industry point of view without details on methodologies.

As a core ingredient of PHM, we summarize the major categories of data-driven prognostics in this section.

2.3.1. Independent Increment Process Based Model

Generally speaking, the stochastic process models (Table 2) consist of two basic components: a stochastic process with initial value , where is the time space and is the state space of the process, and a boundary set , where . Taking outside the boundary set , the first hitting time (FHT) is the random variable , defined as . In most cases, is simplified as a threshold (or ) and the FHT is the first time when reaches .

In stochastic process model, it is supposed that the degradation signal has stationary independent increment, which means for any time , and , the increment only depends on and some other parameters denoted as . Usually, follows a distribution that possesses the property of additivity. For example, if follows a Normal distribution and , then will follow a Normal distribution and is the Wiener process. Other typical choices of the distribution are Gamma distribution and inverse Gaussian distribution, in which cases will be correspondingly called Gamma processes and inverse Gaussian processes.

Wiener Process. A Wiener process can be represented as where is a drift parameter, is a diffusion coefficient, and is the standard Brownian motion. The Probability Density Function (PDF) of the first hitting time is the inverse Gaussian distribution . The process varies bidirectionally over time with Gaussian noises. It only uses information contained in the current degradation status. Please refer to [7681] for more information. Recently, there are some new researches under this framework and use the history information given by the entire sequence of observations. These models update the parameters recursively so the prognostics is history dependent [82, 83].

Gamma Process. One disadvantage of the Wiener process is that it is not monotone with the Brownian motion embedded. For modeling monotonically increasing/decreasing degradation signals, the Gamma process is a better choice. Here, the increment for a given time interval has a Gamma distribution with shape parameter and scale parameter . A Gamma process has monotonic sample paths and can be viewed as the limit of a compound Poisson process whose rate goes to infinity while the jump size tends to zero in proportion. The first hitting time has an inverse Gamma distribution, defined by the identity . Details of modeling degradation with Gamma process are given by Singpurwalla [84], Lawless and Crowder [85], and Ye et al. [86], and maintenance related issues are considered by van Noortwijk [87].

Inverse Gaussian Process. An inverse Gaussian process with mean function and scale parameter has the following properties. The increment has an inverse Gaussian distribution . Like the Gamma process, the inverse Gaussian process also has a monotone path, and the failure time distribution approximated a Birnbaum-Saunders type distribution, which has excellent properties for future computation. The IG process is relatively new and has not been widely applied in degradation modeling, even though it is more flexible in incorporating random effects and covariates. It was introduced in Wang and Xu [88] to incorporate random effects. The random drift model, random volatility model, random drift-volatility model, and incorporating covariates are thoroughly studied by Ye and Chen [89].

2.3.2. Markovian Process-Based Models

Another set of methods are built based on the memoryless Markov processes. Although Markov process still belongs to stochastic processes, these methods are different from the previously mentioned models in the sense that they assume a finite state of the degradation and focus on the transition probability among those states. The methods in this category have the following major variations.

Markov Chain Model. In general, it is assumed that the degradation process evolves on a finite state space , with 0 corresponding to the perfect healthy state and representing the failed state of the monitored system. The RUL at time instant can be defined as . The probability transition matrix and the number of the states can be estimated from historical data. By dividing the health status into discrete states such as “Good,” “OK,” “Minor defects,” “Maintenance required,” and “Unserviceable,” the method can provide meaningful results that are easier to be understood by field engineers.

Semi-Markov Processes. A semi-Markov process extends the Markov chain model by including the random time that the process resides in each state. Although the Markov property is generally lost by this extension, the model remains of great practical value. In a semi-Markov model, the first hitting time represents the time that the process resides in the initial and subsequent states before it first enters one of the states that define set .

Hidden Markov Model (HMM). HMM consists of two stochastic processes, a hidden Markov chain , which is unobservable and represents the real state of the degradation, and an observable process , which is the observed signal from monitoring. Similar to Markovian-based models, it is assumed that the degradation process evolves according to a Markov chain on a finite state space. Generally, a conditional probability measure , is used to link and . As such the RUL at time instant can be defined as . The model is preferred when only indirect observations are available [90].

2.3.3. Filtering-Based Models

Similar as the HMM, the Kalman filtering model does not use the CM directly as the true degradation signal. It assumes that the true state of the degradation is unobservable but related with CM data. The Kalman filtering model considers the unobserved condition, , and the observed CM data, , such that and , where and are Gaussian noises and and are the parameters of the state space model. The Kalman filtering model takes advantage of all historical data, unlike many methods that only depend on the last CM status. However, the linear assumption and Gaussian noise assumption limit its applications. Efforts have been made to overcome these problems (e.g., [91, 92]).

2.3.4. Regression Based Model

Methods in this category mostly involve building parametric evolution path (linear or nonlinear) of CM data with random effects. Most existing methods in RUL estimation assume that products of the same type or from the same batch have exactly the same failure characteristics probabilistically. While the population behavior can provide some reference, they cannot accurately reflect the health evolution for each individual item since individual products often experience different usage patterns, distinct environments, or even different quality due to process variations. Consequently, it is crucial to adapt to the health evolutions of each individual product rather than perceived group averages for better reliability prediction. In recent years, some methods have been proposed to incorporate the population information with observations from individual items to get better RUL estimation.

Meeker and Escobar [93] give an example of a linear degradation with Log-normal rate:where is fixed, , and is the predetermined threshold; thus the distribution of failure time will be

Lu and Meeker [94] proposed several random coefficients model to describe individual health degradation by considering both population trend and individual unit characteristics through fixed and random effects, respectively:where is the degradation signal of item at time ; is the parameters for the fixed effects, and is the parameters of random effects for item . Based on (3), the distribution of failure time, defined as where is the predetermined threshold, can be computed analytically or numerically. Different examples were illustrated for different degradation models and distributions of random effects in the paper. K. Yang and G. Yang [95] extended the idea and utilized both the life time data of failed devices and degradation information from unfailed ones to improve the model estimation. Along this line with applications in a variety of fields, other representative works include Yang and Jeang [96], Tseng et al. [97], and Goode et al. [98]. Although in these results the individual-to-individual variation has been considered using random effects, the data from individual item are only used to assess the variability among items in the population and fit the random effects distribution . The prediction of failure time is still population-wise although it considers the variability within the population. In certain sense, the prediction interval is inflated to cover different degradation paths.

Gebraeel and his colleagues [17, 18, 99] instead developed a Bayesian framework to model the degradation signals and predict the residual life distribution. Different from previous works, the residual life prediction is “customized” based on the data from each individual. For example, the degradation signal can be modeled bywhere and are random variables following certain distribution, and is measurement error (independent normal or Brownian motion). From historical data, the joint distribution of and , denoted by , can be estimated as prior knowledge regarding the population characteristics of degradation. For each operating new individual, the vibration signals are collected and used to update the degradation model using Bayesian method:where the left hand side of (5) represents the posterior distribution of parameters of the degradation model (4) given observations up to time ; the first term on the right hand side corresponds to the likelihood function of the observed data implied by (4) with fixed and ; and the last term is the (estimated) prior distribution of the model parameters within the population. In other words, the degradation model for each individual is self-updating when new observations are available. It is expected that the predicted failure time based on the posterior degradation model will be more accurate.

Due to its simplicity and natural integration, Bayesian framework has continuously been investigated in the literature to provide more accurate degradation modeling and failure prognostics. For example, Xu and Zhou [100] studied the modeling and prognostics using general nonlinear degradation paths, where the Bayesian based model estimation and Monte Carlo based failure time prediction were presented. Chen and Tsui [101] extended this to a two-phase model, which allows for a change point of the linear degradation. This model captures the deterioration of the bearing indicated by the vibration signal. Other references on this topic are Gebraeel and Lawley [102] and Si et al. [82, 83].

2.3.5. Proportional Hazard Model

Proportional hazard model [103] has been extensively studied in various areas. Proportional hazard model with time-dependent variable(s) is able to incorporate both event data and CM data, which can be particularly useful in cases of uncertain failure thresholds or hard failures [104106]. The model assumes the following form for hazard rate:where is the baseline hazard rate,   is a vector of coefficients, and   contains time-dependent variables. can be either parametric (e.g., Weibull) or nonparametric, and model parameters can be estimated by the maximum likelihood method. In this model, the condition monitoring data are viewed as time-dependent covariates in . System failure distribution can be calculated based on (6).

2.3.6. Threshold Regression Model

The parent process and boundary set of the FHT model will both generally have parameters that depend on covariates that vary across individuals: [107]. Cox Proportional Hazard regression is, for most purposes, a special case of Threshold Regression [108]. Any family of proportional hazard functions can be generated by varying the time scales or boundaries of a TR model, subject to only mild regulatory conditions. There is a connection between the shape of the hazard function (HF) and the type of failure mode (cause of failure). For example, an increasing HF corresponds to aging/wear-out and a decreasing hazard function generally suggests a mixture of defective or other weak units leading to infant mortality.

2.4. Condition Based Maintenance

Maintenance is defined as a set of activities or tasks used to restore an item to a state in which it can perform its designated functions. Maintenance strategies can be broadly classified into Corrective Maintenance and Preventive Maintenance strategies [109, 110]. In corrective maintenance, maintenance activities are only carried out after the failure happens and hence should only be used for noncritical systems. On the other hand, preventive maintenance tries to prevent the failure from happening by using either predetermined maintenance such as time-based maintenance or condition-based maintenance (CBM). An example of predetermined maintenance is the commonly suggested practice of changing engine oil every 3,000 miles or three months (whichever comes first), regardless of the actual oil condition. In recent decades, some companies such as GM have developed oil life monitoring systems which allows car owners to change oil only when necessary (e.g., [111]). Such system is an excellent example of CBM implementation. In recent years, CBM has become the most modern technique discussed in the literature and falls well within the framework of PHM.

CBM is a maintenance program that makes maintenance decisions based on the information collected about the underlying system, allowing maintenance activities to be performed only when necessary. The dominating objective of CBM in literature is to minimize the cost for maintenance activities. It is worth pointing out that, in some literature, the term “condition based maintenance” has a broader definition that also involves the preceding steps of data manipulation, diagnosis, and prognosis (e.g., [30]). In this section, we restrict our discussion to maintenance decision-making in CBM.

The underpinning assumption of the PHM framework is that systems are subject to stochastic deteriorations. The natural choice of maintenance strategy for stochastic deteriorating systems is called “control-limit policy,” or “failure limit policy” [110, 112], where maintenance activities are conducted when the system deterioration reaches a certain level. Under such policy, prognostic results on system deterioration can be used for maintenance decision-making. The control-limit policy has been shown to be the optimal replacement rule for systems with increasing deteriorations when considering the average long-run cost per unit time [113]. Existing work on CBM can be classified in several ways depending on the nature of the system and the assumptions they make, which are whether the system health condition is completely observable or partially observable; whether the condition monitoring is continuous or intermittent; whether the maintenance program deals with single component or multiple components. Note that these are not mutually exclusive and a single work usually falls into multiple categories. Below we discuss these topics in more detail.

In condition monitoring, the system health condition can be either completely or partially observed/identified. The system health information obtained in the former case is called direct information, while that in the latter case is called indirect information. While it is a critical issue for the degradation modeling and prognosis previously reviewed in this paper, it has no major impact on the maintenance decision-making. For this reason, we do not discuss this issue again here; interested readers are referred to Jardine et al. [30], which summarized many works in this regard.

Depending on the budget constraint and/or technologies used, condition monitoring can be either continuous or intermittent (periodic or aperiodic), of which the latter is also known as interval inspection. The case of CBM with intermittent condition monitoring has been studied extensively in literature, primarily due to its wide implementation in practice. The important decision variables are the control limit/critical level and inspection interval; optimal critical level and inspection intervals are found based on criteria that are mostly cost-based [114117]. In some works, critical levels are assumed to be predetermined by expert knowledge and only optimal inspection strategies are studied [118120]. To provide more refined maintenance policies to minimize cost, some researchers consider multiple control limits. For example, Castanier et al. [121] used different thresholds for inspection scheduling, partial repair, preventive replacement, and restarting for repairable systems. With the development of sensing and information technology, continuous condition monitoring has become available at reasonable costs in many applications [122]. Comparing with interval inspection, the research in this area is relatively new but has become increasingly popular. The fundamental difference of CBM with continuous monitoring is that the real-time system information allows maintenance decisions to be made at any time and hence the greater chance to optimize the set criteria [123128].

While majority of the existing work deals only with single component, some researchers extend CBM to maintenance decision-making for multiple components in the system. The rationale of developing multicomponent maintenance policy is that there are economic dependencies among multiple components [129, 130]. High fixed maintenance cost, such as sending a maintenance team to a remote wind farm, can be mitigated by replacing/repairing multiple components simultaneously [123, 126, 131, 132].

3. Illustrative Case Studies

In this section, we use three examples to illustrate the implementation of PHM.

3.1. Fault Diagnosis on Gear Crack Development

Gearboxes are one of the most commonly used parts in machinery. Diagnosis of gear faults is crucial for preventing system malfunction. In this example, we demonstrate the fault diagnosis and classification for identifying different development stages of cracks on gears in gearboxes [25, 133]. A gearbox test rig is shown in Figure 3, where gear #3 is the tested gear with potential cracks. Three different types of gears are tested: 0% crack level; 25% crack level; 50% crack level, as shown in Figure 4. The vibration signal is measured using accelerometers under various working conditions: 3 levels of load from the magnetic brake (no load, half the maximum load, and maximum load), and 4 levels of motor speed (1200 rpm, 1400 rpm, 1600 rpm, and 1800 rpm). Three sets of data sample are obtained under each combination of the two factors. Hence we obtain 36 data samples for each crack level. The data are then used for gear crack detection and crack level classification.

All the ten time-domain features listed in Table 1 are calculated as potential candidate features. Six features, peak, mean, root mean square (RMS), skewness, kurtosis, and shape factor, are selected by ANOVA and TFSWT. After feature selection, three methods are applied to classifying the three levels of gear cracks, namely, multinomial logit model (MLM), cumulative link model (CLM), and weighted nearest neighbor (WKNN). Interested readers are referred to Lei and Zuo [25] and Hai et al. [133] for technical details. To assess the performance of the methods, a leave-one-out cross-validation approach is used; the classification accuracies for MLM, CLM, and WKNN are 98.1%, 94.4%, and near 100%, respectively. The test results demonstrate that the proposed methods can accurately identify the crack development of gears, which is very beneficial for early warning of potential gearbox malfunction.

3.2. Predicting RUL of Rotational Bearings

In this section, we return to the motivating example in Section 1, which aims to predict the RUL of bearings. As the purpose of PHM is mainly to provide individualized prediction results, it is necessary to adapt the model to specific characteristics of each bearing, revealed through past observations. A natural choice is to use Bayesian framework to integrate the prior information from other bearings with observations of the in-service unit. By selecting the conjugate prior distributions, model updating can be efficiently done:where is the model parameters, is the observed vibration magnitude at time , is the likelihood function given the parameter, is the prior distribution carrying information from historical bearing samples, and is the posterior distribution of integrating the prior information and current observations. As new observations are collected, model updating can be done repeatedly, and correspondingly the predicted failure time will also be updated:where is the prediction of the vibration magnitude at some future time . As increases, the prediction will be more and more accurate with smaller variance, as demonstrated in Figure 5.

In this example, 25 bearings are tested and Figure 6 shows the prediction interval of the failure time. The -axis is the index of the bearings used in the experiments. The circles with the same -axis value represent the 0.05, 0.5, and 0.95 quantiles of the failure time, and the cross shows the true failure time. The results show that the prediction based on the above algorithm is acceptably accurate. In certain cases, the prediction interval is very tight, providing very informative warnings on the potential failures.

3.3. Predicting RUL of Lithium-Ion Batteries Using Particle Filter

Lithium-ion batteries are widely used in consumer electronics as their sole power sources. The importance of batteries to those devices is arguably critical. As battery ages, its capacity degrades and is widely used as an indicator of the battery’s health. As a common rule, a battery is considered incapable of functioning as intended when its capacity drops to 80% of its initial value. In this example, a particle filtering (PF) based prognostic algorithm is used to predict RUL of lithium-ion batteries based on accelerating testing data of six batteries [134].

In the experiment, batteries were tested with full charging and discharging cycles, under the constant-current/constant-voltage mode. The discharge rate was set to 1C, which meant the battery would be fully discharged in one hour. The experiment was conducted under room temperature, and discharge capacity was calculated based on integrating current over time for each cycle. Figure 7 shows the capacity degradation process of one testing battery.

As capacity is used as the default health indicator of batteries, there is no feature selection needed. We proceed directly to the degradation modeling and prognosis. The model of the degradation curve is assumed as follows:where is the battery capacity at th cycle and ’s are the model parameters. For accurate estimation of and dynamically updating model parameters for better tracking, a PF approach is used. In the PF, the state-space model for tracking has a process function and a measurement function : where is the observed capacity, is the collection of all ’s estimated at the th step, and and are two i.i.d. noise sequences. In (10), and actually define conditional distributions and , respectively. The recursive Bayesian filtering is then carried out via the following equations:Equation (11) provides a recursive way to update the distribution of with newly observed values. For prognosis, the battery capacity at step ahead, , can be estimated by projecting to its all possible future paths based on ). Finally, the RUL distribution is calculated through where is the initial capacity of the battery.

For demonstration, four batteries are used. Three of them are used for initializing model parameters and the last one is used for testing. The predictions are made at 1/3, 2/3, and 4/5 of the battery’s life by treating the data of subsequent cycles unknown to the algorithm. The results are shown in Figure 8. It can be seen that the algorithm can track the observed capacity sequence very well. As expected, the prediction results are better and RUL PDF is narrower at the later stage of the battery’s life.

4. Conclusions

PHM is a framework that offers a complete set of tools for managing system health with individualized solutions. In this paper, we have reviewed methodologies in all major aspects of the PHM framework, namely, signal processing and feature extraction, fault diagnosis and classification, fault prognosis, and condition based maintenance. As can be seen, PHM involves many subareas and hence a huge body of literature. These areas are at very different stages of development. While areas such as signal processing and feature extraction have long been studied, system failure prognosis based on condition monitoring is still at its infancy. Some subareas have already been extensively studied long before the concept of PHM. Excellent surveys have already been done in some subareas, for example, in failure prognosis [19] and condition-based maintenance [30]. Therefore, instead of focusing on an extensive literature review in all subareas, we have taken a holistic view to summarize mainstream methods in them, the role of each area in PHM, and their relationships.

Data-driven methodologies in PHM are closely related with those in some other major research directions, such as statistical quality control, reliability engineering, and design of experiments. It is worthwhile to briefly discuss their relations with PHM.

4.1. Statistical Quality Control

Statistical quality control is an area that has been extensively studied for many decades. The main objective is to detect abnormalities or changes in a process. It is generally applied to a large number of homogeneous units and focuses on identifying the abnormal ones which may be traced back to process faults. PHM, on the other hand, focuses more on how faults happen and how to predict future faults so that optimal maintenance policy can be made, rather than fault detection. Furthermore, research in PHM focuses more on individual behaviors along time instead of cross-sectional analysis on the population characteristics.

4.2. Reliability Engineering

The research in PHM is closely related with those in reliability engineering, such as failure prediction and maintenance. Many methods in PHM stem from those originally developed in reliability engineering. However, they have different focuses of interests. Traditional reliability engineering focuses on the modeling and prediction of the entire product population, without much emphasis on variability of the individuals and their respective working conditions. Therefore, reliability engineering is most valuable for manufacturer’s product design and warranty policy making where population characteristics are crucial, while PHM is most valuable for end users who care more about the specific units they have on hand.

4.3. Design of Experiments

Comparing with the PHM which emphasizes online monitoring and dynamic updating, design of experiments (DOE) is an offline methodology. PHM and DOE are implemented at different stages. DOE is applied mostly during the system planning and design phase, instead of its operating phase where PHM is applied. Tools in DOE are mainly used to analyze relationships between factors and system response(s), which can be very useful for variable selection, enhancing system robustness, and design optimization.

As promising as PHM is, its application in real world is still scarce at the current stage. Comparing to traditional fault diagnosis and maintenance programs, PHM has higher initial cost and higher requirement for the field workers. Limited research in prognosis and CBM is also the major hurdle of its wide application. To achieve cost-effective, robust, and easy-to-implement solutions so that PHM can be applied to more real world applications, there are many challenges as well as research opportunities. These include, but not limited to, development of robust yet low-cost sensoring technologies for online monitoring; development of more computationally efficient techniques for dealing with high-volume data; development of specialized signal processing and feature extraction/selection techniques optimized for condition monitoring and failure prognosis; development of more accurate prognostic methods that can deal with multiple CM signals and multiple system failure modes; development of versatile CBM strategies that are capable of handling complex situations such as multicomponent systems, possibly with different maintenance levels and multiple optimization criteria.

Although there are still many hurdles to clear before PHM can be widely implemented in real world engineering applications, its promising future has increasingly attracted many researchers and engineers from related fields. However, most of their research work is scattered in respective areas, without much collaboration by considering the holistic framework of PHM. As suggested by Lee et al. [135], the authors believe an integrated platform of diagnostics, prognostics, and maintenance will be the future trend of PHM.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work of Y. Hai, K. L. Tsui, and Q. Zhou was supported by the Hong Kong Research Grant Council General Research Fund Project no. 11216014 and the Natural Science Foundation of China Project no. 11471275. It was also partially supported by the NSFC under Grant nos. 71231001 and 71420107023. It is also supported by China Postdoctoral Science Foundation funded project under Grant 2013M530531, the Fundamental Research Funds for the Central Universities of China under Grant nos. FRF-MP-13-009A and FRF-TP-13-026A, and the MOE PhD supervisor fund, 20120006110025.