#### Abstract

This paper discusses the predictive maintenance (PM) problem of a single equipment system. It is assumed that the equipment has deteriorating quality states as it operates, resulting in multiple yield levels represented as system observation states. We cast the equipment deterioration as discrete-state and continuous-time semi-Markov decision process (SMDP) model and solve the SMDP problem in reinforcement learning (RL) framework using the strategy-based method. In doing so, the goal is to maximize the system average reward rate (SARR) and generate the optimal maintenance strategy for given observation states. Further, the PM time is capable of being produced by a simulation method. In order to prove the advantage of our proposed method, we introduce the standard sequential preventive maintenance algorithm with unequal time interval. Our proposed method is compared with the sequential preventive maintenance algorithm in a test objective of SARR, and the results tell us that our proposed method can outperform the sequential preventive maintenance algorithm. In the end, the sensitivity analysis of some parameters on the PM time is given.

#### 1. Introduction

In real production system, equipment deterioration is almost universal with use, age, and other causes. If the maintenance is not performed, eventually the failure or severe malfunction can occur. Operating the equipment in a deteriorating state often brings about higher production cost and lower product quality. Therefore, an effective maintenance policy is very essential in industrial practice. The periodic or age-based preventive maintenance strategy often leads to inadequate maintenance or over maintenance, in which over maintenance will cause unnecessary interference to production, resulting in the decreased production efficiency and increased production cost. The aim of condition-based maintenance is to see if the maintenance decision should be performed according to the current system state [1]. Nevertheless, the more valuable issue is to determine the future maintenance time in the current system state, which is called PM in this paper.

There are few theoretical and practical researches on PM in a strict sense compared with condition-based maintenance [2]. In some literature, condition-based maintenance has been classified as PM, but the true “predictive” aspect of condition-based maintenance decisions, such as anticipating and predicting the future state of the equipment, has not been reflected. There are few true PM methods that can conduct scheduled optimal future maintenance time by considering the deteriorating equipment condition. Existing methods of classifying equipment states are mainly divided into two types, operational state or failure state, and the goal of PM is only to predict the residual life [3–8]. For example, Sikorska et al. review a large number of pieces of literature related to prediction models, which are mainly utilized to predict the residual equipment life [9]. Jan et al. can evaluate the current states and predict the residual life for industrial equipment by a hidden semi-Markov model [10]. Schwendemann et al. present the prediction of residual life for bearings in grinding equipment under the premise of taking a more global view of the optimization problem involved such as the costs and time [11].

Moreover, in real industrial systems, such as semiconductor production and precision instruments, the deteriorating equipment states are closely related to the quality levels of the products [2]. Based on the extensive industrial practice, General Motors researchers have pointed out the important potential of the correlation between operation management including maintenance decisions and product quality to improve the performance of manufacturing systems [12]. Before the equipment breaks down, the fact is that when the equipment is in a deteriorating quality state, it can still operate, but the probability of producing unqualified products is increased [13]. For a long time, the issue of maintenance and quality is considered to be two relatively independent research fields, and the scholars and industrial people have done a lot of research work in these two fields. But the research on the correlation between equipment maintenance and product quality is still a brand-new field. In existing literature and industrial practices, it is usually assumed that the product quality problems are the Bernoulli and persistent quality problems [13, 14], while the multiple yield quality problems have more realistic and general significance, so it is more worthy of much deeper research. The multiple yield quality problems refer to the fact that the product quality problems occur independently but with a stage probability level. The reason for the stage probability level is that the equipment states gradually deteriorate and have multiple quality states. For multiple yield quality problems, there needs a balance between production and maintenance, and there is no simple and direct maintenance decision. In addition, related researches on equipment maintenance often assume that the production time and the maintenance time are unit time, and strong assumptions are also made about the equipment deterioration mode [15]. The maintenance decisions based on the above assumptions are lack of realistic basis.

Therefore, we claim that it is of great significance to make maintenance decisions by taking quality inspection data into account, which is able to keep the costs down and meet the needs of industrial production management. There are relatively few studies on the factors of production, maintenance, and quality, and no effective methods have been found to find a solution in the existing literature. We attempt to solve the equipment maintenance problem in production practice. Since the deteriorating equipment states cannot be directly observed, a large amount of real-time quality inspection information can be used as the implicit information. A discrete-state continuous-time SMDP with a large number of yield stages is induced to describe the equipment deterioration process. However, it is worth noting that the production and maintenance time are random variables that follow general distributions based on realistic considerations. A strategy iteration-based RL method is put forward to guarantee the optimal strategy solution to the model. Furthermore, the future maintenance time corresponding to each observed state can be produced by a simulation method based on the fixed maintenance strategy, and the influences of the main technical parameters on the optimization goal of the system are analyzed. And finally, the advantages of our proposed RL method for solving such a dynamic environment problem are revealed compared with the sequential preventive maintenance algorithm with unequal time interval.

#### 2. Problem Description

This paper investigates deteriorating equipment that has multiple discrete states. Assume that the equipment condition can be directly reflected by the condition monitoring measures such as the yield levels. A single type of product is produced, and each processed product is immediately inspected to identify an unqualified product or a qualified product. The inspection time and inspection cost are assumed to be zero. Due to the fault of the inspection equipment or the proficiency of the inspection workers and other reasons, there are certain inspection errors in product quality inspection. The inspection errors are mainly divided into two types [16]:(i)Type I error: that is the false detection with a probability *e*_{1} and the cost *C*_{e1}. The parameter *C*_{e1} includes the production cost per unit product and other related costs.(ii)Type II error: that is the missed detection with a probability *e*_{2} and the cost *C*_{e2}. The parameter *C*_{e2} includes the production cost per unit product and other possible costs such as the costs arising from quality and safety issues which is far beyond production costs.(iii)In addition, through the accurate inspection, the profit of producing a qualified product is and the cost of producing an unqualified product is *R*_{d}*.*

#### 3. System Model

The sequential decision-making problem under uncertain conditions can be solved by analyzing the Markov process. A large number of researches related to this issue can be found in stochastic dynamic programming and other related literature [17–22]. However, in many of these studies, the Markov chains cannot define the characteristic of basic probability structure such as a general probability distribution of the sojourn times in each quality state. Then the problems are often described as SMDP because the SMDP represents a more realistic situation, and it is more suitable to model the deteriorating process of the equipment.

We employ a discrete-state continuous-time SMDP model to present the deteriorating process of the single equipment system, as shown in Figure 1. Since the yield level *y*_{kl} cannot be obtained directly, the inspection information *s* *=* (*k*, *p*, *b*) is used as the observed system state, in which *k* is the number of subcycles in each production-maintenance cycle, *b* is defined as the number of unqualified products, and *p* is defined as the number of products produced from when the equipment is last maintained or repaired. The action space is denoted as *A*(*s*) = {0, 1, 2}, where *a* = 0 represents to keep the equipment operating and produce new products; *a* = 1 means to stop the operation of the equipment and perform an imperfect (minor) maintenance action (corresponding to the MM in Figure 1); *a* = 2 represents the major repair action to be performed in the event of a failure or random failure of the equipment (corresponding to the MR in Figure 1). In the deteriorating process of the equipment, the decision point of the maintenance action is the time point for production and inspection of new products. By means of performing MM action, the yield level of the equipment can be restored to a certain intermediate state (e.g., *y*_{21}), after which the *k* + 1′th subcycle is initiated. The subcycle continues until a certain yield level limit appears or a stochastic malfunction occurs. At this point, the major repair is forced to be triggered to restore the yield level of the equipment to the best state (e.g., *y*_{11}), and then another updating subcycle is initiated.

In general, the equipment in a production system deteriorates as its condition is getting worse, which will lead to the result of the shorter sojourn time in each quality state. Therefore, this paper assumes that the sojourn time *λ*_{kl} under each yield level *y*_{kl} follows a gamma distribution Г (*α*_{kl}, *β*), and the of *l* can decrease *λ*_{kl}. That is, *α*_{k,l+1} = *b*_{s}*α*_{kl} (0 < *b*_{s} < 1). Meanwhile, it is assumed that the stochastic malfunction time interval also follows a gamma distribution under the *k*′th subcycle, where the shape parameter in the gamma distribution . Moreover, the random failure time interval also decreases gradually, which is presented by the following equation:

#### 4. Policy Iteration-Based PM Method

The model-free RL is divided into two algorithms, including value iteration-based algorithm and strategy iteration-based algorithm, respectively. Nevertheless, if it is used to solve SMDP problems, the value iteration-based RL algorithm is not suitable, mainly because this algorithm cannot guarantee that the average reward SMDP problems produce the optimal solution [23]. On the contrary, the strategy iteration-based RL algorithm can obtain accurate and satisfactory results. Therefore, this paper adopts the average reward strategy iteration-based RL method for finding a solution to our problem. The optimal maintenance strategy under the premise of maximizing SARR is given.

##### 4.1. Q-P Learning Algorithm

The RL technology approaches the optimal strategy in the SMDP model through strategy iteration and learns the mapping from environment state to behavior through trial and error, so as to maximize the cumulative SARR from the environment [23]; namely,

The Q-P learning algorithm can accurately solve the SMDP problems based on average cumulative rewards. In each decision cycle, the current state *s* is transferred to state *s*′ under the decision *a*, and the updating expression is as follows [23]:where *r* (*s*, *a*, *s'*) is the total immediate reward with the action *a*_{j} (*j* = 1, 2) when the state *s* is transferred to state *s*′; *t* (*s*, *a*, *s'*) is the interval time with the action *a*_{j} (*j* = 1, 2) when the state *s* is transferred to state *s*′; *ρ* is the reward rate, which can be obtained by the following equation [24]:

*α i*s defined as the learning rate, and the decreased rules are as follows [23]:where *n*_{max} is a large positive integer; *α*_{0} is the initial value of *α*; and *α*_{0} = 0.1. It should be noted that the value of *α*_{0} will have a certain influence on the final convergence of the RL algorithm, which can be referred to [25] for details. The parameter is the visit-factor representing visit times. In addition, the immediate rewards *r* (*s*, *a*, *s'*) caused by state transitions are as follows:(i)*r* (*s*, *a*, *s'*) = is defined as the profit of qualified product produced(ii)*r* (*s*, *a*, *s'*) = −*C*_{e1} is defined as the loss of Type I error(iii)*r* (*s*, *a*, *s'*) = −*R*_{d} is defined as the production cost per unit(iv)*r* (*s*, *a*, *s'*) = −*C*_{e2} is defined as the loss of Type II error(v)*r* (*s*, *a*, *s'*) = −*C*_{R} is defined as the loss of major repair(vi)*r* (*s*, *a*, *s'*) = −*C*_{M} is defined as the loss of minor maintenance

The current strategy of Q-P learning algorithm is , and the value *Q* is updated with the value *P*. The processes of strategy evaluation and strategy improvement are executed repeatedly, and finally, the optimal maintenance strategy is obtained, which mainly includes three essential steps: exploration, strategy evaluation, and strategy improvement. The detailed process is depicted in Figure 2. Step 1: Initialization(i)Initialize the maintenance strategy *P* (*s, a*), a random value; initialize the maximum updating times of the strategy improving *E*_{max} and the maximum updating times of the strategy evaluation *N*_{max}; initialize the learning rate parameters and and the exploration rate parameters and ; set the increase times of the outer loop policy *E* = 1.(ii)According to the known maintenance strategy *P* (*s, a*), calculate the average reward rate *ρ*; initialize the state-action value of the strategy evaluation process *Q* (*s*, *a*) = 0; set the current strategy updating number *N* = 1 and the visit times and . Step 2: Strategy Evaluation(i)Initialize the current state *s* = (1, 0, 0), the average failure interval *T*_{f}, and the cumulative state transition time *T*_{c}.(ii)Choose the greedy action *a* basing on the probability 1−*p*_{n}; otherwise, the random action *a* is selected based on the probability *p*_{n}*.*(iii)Simulate the decision action *a* in state *s*; the observation state is transformed to state *s'.* If *a* = 0, a new observation state is obtained, and the transition time *t* (s, *a, s'*) and the reward *r* (s, *a, s'*) between the state *s* and state *s'* are directly produced. The action value *Q* is updated by using equation (3): Update state and . If , jump to Step 2 (iv); otherwise, jump to Step 2 (v); if *a* = 1, the imperfect maintenance is performed, and the new observation state and immediate reward are obtained. Then the action value *Q* is updated, *k* *=* *k* + 1, and the program jumps to the Step 2 (v).(iv)When the major repair is performed, the corresponding immediate reward and state transition time are obtained, and the action value *Q* is updated. If *N* > *N*_{max}, jump to Step 3 (i); otherwise, jump to Step 2 (ii).(v)Update the visit factors and ; update the learning rate and the exploration rate *p*_{n}, and then jump to Step 2 (ii). Step 3: Strategy Improvement(i)Let *P* = *Q* and *E* = *E* + 1; if *E* = *E*_{max}, stop the learning process; otherwise jump to i-b and continue learning.(ii)According to the action value *P*, calculate the optimal strategy *π*^{∗} by using the following equation:

##### 4.2. Optimal PM Time

In Section 4.1, the optimal maintenance strategy *π*^{∗} of the deteriorating equipment can be obtained by the proposed method. In this section, the optimal maintenance strategy *π*^{∗} and the equipment deteriorating process model are used to estimate the future maintenance time corresponding to different observation states *s*_{i}. Firstly, one-dimensional vector *V*_{d} of unqualified product state and one-dimensional vector *V*_{t} of production time are defined. These two vectors record the accumulative quantity of unqualified product *b* and the production time *t* per unit product respectively. During the process from production to maintenance, the failure interval is the sum of the sojourn times in different quality states of the same deterioration mode. The initial action is *a* = 0, and the new observation state can be produced after the equipment goes through production and quality inspection. Based on the maintenance policy *π*^{∗}, the actions of a new state can be obtained until the equipment performs maintenance action. The vector *V*_{d} records the state from production to the maintenance process. The vector *V*_{t} can directly calculate the maintenance point in time of different states *s*, which is used as an effective PM time. In the simulation process, the state transfer process is random, the same state can be recorded for many times, and the average value is taken as the PM time for the observed state.

The detailed process for obtaining the PM time is shown in Figure 3. First, the parameters related to PM time are initialized, and then the production process of the equipment is simulated according to the known maintenance strategy *π*^{∗}. The quality state and the production process are random in the simulation process. The maintenance policy is applied to the model of Figure 3, the PM time corresponding to the observation state *s*_{i} is produced, and the mean value is formulated as the final estimate.

#### 5. Simulation Study

The maintenance action is imperfect; that is, after maintenance, the quality state of the equipment will be improved, and the yield level also will be improved, but the equipment will not be restored to a new state. So, to what extent will the equipment be restored after the maintenance? This section mainly explains this process through the change of yield level before maintenance and after maintenance. Referring to the ideas of Zhu et al. [26], for two continuous deteriorating subcycles, the yield function relationship is as follows:

*t* represents the time since the equipment is last maintained or repaired; *b*_{k} is a degradation factor of equation (8), which is a value between 0 and 1; *a*_{k} is defined as an age degradation factor, which is a value between 0 and 1; *D*_{k} represents the time interval of the *k*'th subcycle. The discrete yield levels can be determined by equation (9), where *L* is the number of prespecified yield levels in each subcycle *k*.

##### 5.1. Numerical Experiments

According to the problem description and the modeling description of the deteriorating equipment in this paper, the relevant parameters are assumed and given in Table 1. Other relevant parameters are explained as follows: the maximum updating times of the strategy improving *E*_{max} = 15; the maximum updating times of the strategy evaluation *N*_{max} = 10000; the visit factor is the visit times for a certain state, which is a changing value. The yield level is a discretization for the equipment states, and from the fuzzy point of view, it can be divided into four levels: excellent, good, medium, and poor. Each state corresponds to a certain interval time between failures. If the discretization level of the equipment is too high, the simulation state will jump frequently and cannot reflect the continuous production process under a certain condition. We assume that the critical yield level , and *T*_{c} ≥ *T*_{f} or is the condition for completion of a single strategy evaluation. Due to the randomness of quality inspection, the designed worst critical condition of the equipment is 0.6 in order to ensure the correct jump in the simulation process. Moreover, in the real simulation process, this condition only plays a role accidentally.

The method proposed in this paper is adopted for learning and the learning results are shown in Figure 4, which is compared with the sequential preventive maintenance algorithm [27]. As can be seen from the figure that the SARRs of the strategies learned by the methods are well convergent, and the proposed method is clearly much better than the sequential preventive maintenance algorithm according to the SARRs. This situation arises in part from the fact that the maintenance policy has not been coupled in the sequential preventive maintenance algorithm to maximize the total SARR.

##### 5.2. Sensitivity Analysis of the Parameters

###### 5.2.1. Impact of Decrease Factor of Sojourn Time

The sojourn time *λ*_{kl} for each state is related to the decrease factor of sojourn time *b*_{s}. The smaller *b*_{s} is, the change of *λ*_{kl} will be greater. Correspondingly, the PM time will also change. The PM time increases when *b*_{s} decreases, as shown in Figure 5. The reason is that the equipment will be maintained for a considerable period of time to produce qualified products with a high probability when *b*_{s} is smaller, and the expected SARR in the long run will increase. For example, when *b*_{s} decrease from 1 to 0.6, the expected SARR changes from 20.6 to 30.

###### 5.2.2. Impact of Quality Detection Error

Figure 6 shows that the PM time for each observed state shows slight declines as the probability of Type II error *e*_{2} increases from 0 to 0.1. The reason is that the increase of *e*_{2} can result in a reduction of long-run expected SARR, and it decreases from 31.8 to 30.6. At the same time, the PM time is not sensitive to the change of Type II error *e*_{2}; this is due to the fact that the cost *C*_{e2} of Type II error *C*_{e2} *=* 100 is comparatively small. Similarly, the PM time shows slight declines when *e*_{1} continues to increase, because *C*_{e1} is comparatively small and the growth parameter *e*_{1} can result in a reduction of long-run expected SARR.

###### 5.2.3. Impact of the Cost or Profit

*(1) Impact of the Cost C*_{f}*.* The parameter *C*_{f} refers to the cost of wrongly identifying a qualified product as an unqualified product. From Figure 7, we can see that the PM time decreases as the cost *C*_{f} increases; this is due to the fact that increase in *C*_{f} leads to a decrease in the long-term expected SARR. Meanwhile, Figure 7 shows that the PM time seems to be insensitive to the change of *C*_{f}, which is caused by the assumption of a very small false detection probability *p*_{f} in this paper.

*(2) Impact of the Cost C*_{n}*.* The parameter *C*_{n} is the cost of wrongly identifying an unqualified product as a qualified product. As shown in Figure 8, when the inspection cost *C*_{n} increases, the PM time decreases; this is because the expected SARR in the long run decreases as the cost *C*_{n} increases. Figure 8 also shows that the PM time is not sensitive to the change of *C*_{n}, which is caused by the assumption of a very small probability of missed detection *p*_{n} in this paper.

###### 5.2.4. Impact of Initial Quality Deterioration Rate *k*_{y}

The coefficient describes the initial deterioration rate of the equipment, as shown in Figure 9. The PM time is not sensitive to the change of the coefficient; this is due to the fact that the change of the coefficient in a certain extent basically makes no difference to the SARR.

#### 6. Conclusion

In this paper, we propose a PM method for single deteriorating equipment having multiple yield quality problems. It is assumed that the yield stage is coupled with the equipment quality state, and a stochastic breakdown can also occur besides the quality failure. Moreover, the equipment cannot return to normal operating condition without repair. We assume that there are two decision actions including MM and MR in each observation state. The preventive maintenance is MM, which can be performed in a deteriorating quality state, while the MR is forced to be implemented in a failure state. A discrete-state continuous-time SMDP model is proposed to present the deterioration process of the equipment. The Q-P method in the RL framework is utilized to solve the SMDP model. Given the product quality inspection data with certain detection errors, the optimal maintenance strategy based on each observed state is produced by taking into account the goal of maximizing the long-run expected SARR. The PM time is capable of being achieved by a simulation method.

Through the simulation examples, it is proved that the proposed method adopted in this paper is capable of solving the PM problems of the equipment under dynamic environment. The experimental results also prove that the proposed method can outperform the standard sequential preventive maintenance method with unequal time interval. The change of maintenance action rules is further shown, which is not progressive with the increase of maintenance times and unqualified rate. It can also be observed that the PM time depends on the observed state, and it decreases as the total number of products produced increases and also decreases monotonically as the number of unqualified products increases for a given total number of products produced. Moreover, an increase in the number of maintenance times will also cause a decrease in the PM time. In addition, the influences of the main parameters on the optimization goal are also investigated.

#### Data Availability

The relevant data of calculation used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by the Natural Science Foundation of Liaoning Province under Grant 20180550746 and the National Science Foundation of China under Grant 61901283.