Model-based methods can be used to detect anomalies in industrial robots, but they require a high level of expertise and are therefore difficult to implement. The lack of sufficient data on the anomalous operation of industrial robots limits data-driven anomaly detection methods. This study proposes Sliding Window One-Dimensional Convolutional Autoencoder (SW1DCAE), an unsupervised vibration anomaly detection algorithm for industrial robots, that can directly act on the original vibration signal and effectively improve detection accuracy. First, the convolutional neural network and the autoencoder model are effectively integrated to construct a one-dimensional convolutional autoencoder model. Secondly, the sliding window algorithm is used for data enhancement, and the dropout technique is introduced to improve the generalization ability of the model. Finally, the reconstruction error of the input sample is calculated and compared with the error threshold to determine whether the operation state of the industrial robot is normal or not. This study discusses the effect of different convolution kernel widths, sliding window sizes, dropout ratios, and other parameters on model performance. Validation with vibration signals collected from an industrial robot test bench shows that this unsupervised anomaly detection algorithm has good accuracy and F1 score.

1. Introduction

Industrial robots are the product of the fusion of multidisciplinary fields. They can replace humans to do some high-precision, high-intensity, high-risk, and high-repetition work [1]. Industrial robots are booming and becoming an essential part of the automation industry. Statistics from the International Federation of Robotics show that the number of industrial robot applications has increased year by year, and in 2018, it has exceeded 40,000 units [2]. With the booming development of industrial robots and the increasing number of applications, our reliance on them is on the rise, and if they break down during operation, they can cause loss of life and property. Therefore, it is very necessary to detect the abnormal operation of industrial robots.

In recent years, scholars at home and abroad have carried out a lot of research studies on anomaly detection of industrial robots. In general, anomaly detection methods fall into two main categories; the first is model-based anomaly detection methods. Chen et al. [3] proposed an algorithm that successfully detected minor collisions between two rotating planar manipulators and abnormal vibrations by combining a dynamics model and a friction model without the use of additional sensors. Dang et al. [4] proposed a fault detection algorithm and health assessment method for a robot operating system, which successfully detected faults and obtained the health status of industrial robots by considering internal and output measurement disturbances. The second category is based on data-driven anomaly detection methods: Vallachira et al. [5] proposed a combination of four preprocessing techniques to improve the performance of fault diagnosis for industrial robot gearboxes. Cheng et al. [6] proposed an unsupervised fault detection framework based on GMM, which utilizes the current signal to effectively detect the abnormal operation of industrial robots. Long et al. [7] combined sparse autoencoder (SAE) and support vector machine (SVM) to construct a fault diagnosis model for multijoint industrial robots by learning a pose dataset with multiple fault information. Jaber and Bicker [8] adopt the time-frequency signal analysis method based on discrete wavelet transform to extract the most significant features related to faults and uses an artificial neural network to classify the faults of robot joint gears. Eski et al. [9] proposed a fault detection method based on vibration and noise signals, using radial basis neural networks (RBNN) to process the signals and achieve fault detection in industrial robots. Model-based anomaly detection technology needs to build an accurate mathematical model, which requires researchers to have a solid mathematical foundation and sufficient professional knowledge, in addition to spending a lot of time and effort to adjust parameters and optimize the model. However, data-driven anomaly detection techniques do not require much prior knowledge and detect anomalies completely from the collected signals and datasets.

In the field of data-driven fault diagnosis, there are usually two main types of methods, supervised and unsupervised. Among the supervised approaches, Wen et al. [10] proposed a LeNet-5-based CNN neural network that successfully performed fault diagnosis for bearings, centrifugal pumps, and hydraulic pumps by extracting features from the time-frequency map of the signal. Jing et al. [11] proposed a CNN model that can learn features and detect gearbox faults directly from the frequency data of vibration signal. Zhang et al. [12] proposed a TICNN model for bearing fault diagnosis, which can directly process raw vibration signals. For unsupervised or semisupervised algorithms, Sohaib et al. [13] proposed a deep neural network based on sparse stacked autoencoder (SSAE) to successfully detect bearing faults. Chen et al. [14] proposed a stacked variational autoencoder (VSAE) to adaptively extract features from angle signals to successfully detect diesel engine failures. The above research proves that fault diagnosis in a data-driven direction is very effective. Among them, the classical models of supervised algorithms, such as convolutional neural networks, rely on a large amount of data and labels. However, sometimes the failure data of mechanical equipment are very precious and scarce. For unsupervised or semisupervised algorithms, such as autoencoders, the network connections are usually fully connected layers, which have lower computational performance than convolutional networks.

In this study, a sliding window one-dimensional convolutional autoencoder (SW1DCAE)-based approach for unsupervised vibration anomaly detection in industrial robots is proposed. The main contributions are as follows:(1)The SW1DCAE model uses convolutional, pooling, upsampling, and deconvolution layers instead of the fully connected layers of conventional autoencoders, allowing the raw sensor signal to be used directly as input to the model.(2)The sliding window algorithm is used for data enhancement and shaping samples of a specific length, and the dropout algorithm is introduced to prevent the model from overfitting and enhance the generalization ability. The effects of different convolution kernel widths and sliding window lengths on the performance of the model are discussed, and reasonable parameter settings are selected to improve the accuracy and F1 score of the model.(3)SW1DCAE is an unsupervised detection algorithm that does not require label data and can be applied online without data accumulation. It has strong practical value and can be widely used in other fields such as vibration anomaly detection of industrial robots.

2. Theoretical Background

2.1. Autoencoder

Autoencoder is a classic unsupervised learning model, which can extract the hidden features of input samples and realize the reconstruction of samples. It usually consists of two parts, encoder and decoder, as shown in Figure 1.

Suppose the input signal is defined as .

The encoding process is

The decoding process iswhere X is the input vector, H is the hidden feature, is the output vector, and are weight matrices, and are the biases, and is the activation function.

ReLU is the activation function, and the formula is

The loss function is the mean square error function, and the formula is expressed as

2.2. Convolutional Neural Network

Among supervised algorithms, the convolutional neural network is a very typical learning model. Convolutional neural networks usually consist of convolutional layers, pooling layers, and fully connected layers, as shown in Figure 2.

The convolution layer consists of a convolution kernel, which performs the convolution calculation. Its output equation is as follows:where is the convolution output of the mth channel of the nth layer, is the feature vector of the ith channel of the n − 1th layer, is the convolution kernel, and is the offset.

The pooling layer is usually the maximum pooling layer, which can reduce the dimension of the feature surface and the training cost. The output formula iswhere n is the width of the pooling area and is the nth data of the pooling area.

The fully connected layer is set after convolution and pooling. The output formula iswhere ω is the weight, b is the bias parameter, f is the activation function, and p is the input.

2.3. Sliding Window Algorithm

Sliding window algorithms can be used for data augmentation, producing samples of a reasonable length to take advantage of the best performance of the model. The window slides over the time series in steps of a certain length, producing samples with overlapping portions. For a time series, , take the sliding window length to be . A time series of length can be obtained by the sliding window algorithm denoted as . Then, set the starting point of the window at the second point of the time series T to obtain a time series of length , denoted as , and so on. By introducing a sliding window algorithm to intercept overlapping time-series segments and shape the original data into samples of appropriate length, the network model is enabled to better capture the serial correlation of the data. The sliding window principle is shown in Figure 3.

3. Methods

3.1. Dropout

This study introduces the node random dropout technique (Dropout) proposed by Srivastava et al. [15] to prevent overfitting and improve the generalization ability of the model. The principle is shown in Figure 4. The former is an ordinary neural network, and the latter is a neural network with dropout added.

Without dropout, the output of the neural network is

When dropout is introduced, the output of the neural network iswhere follows a Bernoulli distribution with parameter p.

The dropout technique makes the neuron nodes choose randomly, and the abandoned neurons no longer participate in the calculation, thereby weakening the common adaptive relationship of neurons and improving the generalization ability.

3.2. Construction of the Model

The one-dimensional convolutional autoencoder takes the autoencoder as the basic architecture and the convolution kernel as the basic computing unit. It effectively integrates the convolutional neural network and the autoencoder model and extracts input samples through convolution, pooling, and deconvolution. The hidden features of the network structure are shown in Figure 5.

The SW1DCAE model is divided into three working phases: data processing, encoding, and decoding.

In the data processing stage, through the sliding window algorithm, the long time series is cut into several short time series, and training samples are generated so that the network can better capture the serial correlation of the data. Suppose the time step is T (the length of the window), the dimension of the observation data in the window is dx, and the dimension of each input data is (T, dx).

The encoder consists of a one-dimensional convolutional layer, a dropout layer, and a one-dimensional pooling layer. The convolutional and pooling layers encode the data to find the implied features, and for the convolutional kernel, the output iswhere the ReLU activation function is

The dropout layer effectively prevents overfitting. The pooling halves the dimensionality of the data and the output iswhere the pooling step size, s, is set to 2 and the pooling window size, , is set to 2.

The decoder consists of a deconvolution layer, an upsampling layer, and the output of the upsampling layer iswhere is the location of the largest value recorded by the pooling layer.

The deconvolution layer is the reverse calculation of the convolution layer, and the output of the ith channel iswhere is the deconvolution kernel and represents the deconvolution calculation. The training objective of this model is to minimize the reconstruction error between the input and output. The mean squared error function (MSE) was chosen as the loss function and expressed as follows:

The magnitude of the reconstruction error (MSE) is evaluated to determine whether an anomaly has occurred in the current input. In this study, the box-line plot method is chosen as the criterion for determining anomalies. The box-line graph method is based on quartile and interquartile distances and is more robust than the 3σ criterion as it does not require a hypothetical distribution of the data. This is shown in Figure 6.

When using the boxplot method to determine the abnormality, the data position other than the upper and lower edges is the abnormal point position, which is used as the judgment threshold. The formula for calculating the upper and lower edges iswhere is the upper quartile of the reconstruction error distribution produced by the SW1DCAE model in the normal state of the robot, is the lower quartile of the reconstruction error distribution produced by the SW1DCAE model in the normal state of the robot, and IQR is the interquartile range.

The abnormal judgment is based on

When the loss generated by the sample input into the SW1DCAE model exceeds the threshold, it is determined to be abnormal.

3.3. Model Training and Optimization

The SW1DCAE model learns the implicit features by reconstructing the input signal, and the loss function of the network model iswhere X is the input signal and is the output signal.

The model parameters are optimized by the gradient descent method, and the calculation method for the gradient of the convolutional layer iswhere x is the input and C is the output, representing the convolution operation, and flip represents the transpose of the matrix.

For max-pooling layers, the gradient is calculated aswhere x is the input and p is the output.

For upsampling layers, the gradient is calculated aswhere is the ith input, is the ith output, and k is the upsampling step size.

The gradient of the deconvolution layer is calculated as

Among them, is the jth deconvolution kernel, x is the input signal, D is the output, m is the deconvolution step size, and t is the length of x of the input signal.

The training and testing process for the SW1DCAE model is shown in Algorithm 1.

Step 1: unsupervised training
(0) Sliding window processing training set
(1) Build the SW1DCAE model. Set learning rate, training epochs, and other hyperparameters
(2) Randomly initialize the weights and biases of the network model
(3) Input training set
(4) Cycle training times N
(5)  Encoding, convolution, pooling
(6)  Decode, compute upsampling, deconvolution
(7)  Calculation error
(8)  Gradient descent, update the gradient parameters of each layer
(9) End loop
(10) Save model structure, parameters
Output: Reconstructed data , approximately equal to input data X.
Step 2: exception test
(0) Sliding window processing test set
(1) Input the test set into the trained model saved in step 1.
(2) Set the loss threshold, and the reconstruction error of the samples in the test set exceeds the threshold, which is judged as abnormal
Output: test sample for anomalies.
3.4. Vibration Anomaly Detection Process Based on the SW1DCAE Model

The vibration anomaly detection process based on the SW1DCAE model is shown in Figure 7. The process is divided into two parts: offline training and online detection. In the offline training part, we train the model with the training data and save the trained model. In the online detection part, load the training model and directly judge the abnormality of the test sample. The anomaly detector maintains a stream of input data through a time window of length . Each window of data input to the anomaly detector allows the reconstruction error to be calculated for each point and the reconstruction error to be compared to a threshold value to determine if an anomaly exists.

4. Experimentation and Analysis

4.1. Experimental Setup

To demonstrate the validity of the SW1DCAE model, vibration anomaly detection was performed on vibration signal data collected from the APE sv45 six-axis industrial robot experimental rig. The experimental rig is shown in Figure 8 and contains the six-axis robot body, actuator, controller, and accelerometer.

The vibration signals were collected by accelerometers fixed to the robot axes at a sampling frequency of 150 Hz. The robot was operated in a PTP mode between set points, and the vibration signals collected during the experiment totaled over 110,000 points. The signals collected at this time are normal signals, which show periodic changes with the fixed running cycle of the robot. These data can be used to train the network model. To simulate anomalous conditions, we caused abnormal vibrations by manually hitting the robot. The criteria for determining abnormal vibration signals are based on statistics: signals exceeding the upper and lower limits of the normal signal amplitude box plot are identified as abnormal vibration amplitudes, and such signals are used as abnormal samples to construct the test set. Finally, choose the acceleration of the third axis and make a training set with 5836 samples and a test set with 5836 samples. All samples in the training set are normal signal samples, and the test set contains 2500 abnormal signal samples. Before training, the data were preprocessed to a standard score, i.e., a mean of 0 and a variance of 1. Some of the original time-domain signals are shown in Figure 9. The data stream needs to be cleaned before online detection to solve the common problem of missing data. The problem of missing data is solved by using the data completion method, that is, the nearest completion method. The last existing data are selected as the current missing value, which consumes almost no additional data and ensures real-time detection.

4.2. Experimental Part

During the experiment, the size of the sliding window was set to 50, the dropout ratio was set to 0.2, the batch size was set to 128, the Adam optimizer was selected, and the initial learning rate was 0.001. The structure of each layer in the network is shown in Table 1.

For the encoder part, a wider convolution kernel was chosen in the first convolution layer for better feature extraction. A window size of 2 and a step size of 2 were chosen in the pooling layer so that the dimensionality of the data could be reduced while retaining the important features. The parameter settings for the decoder part, the upsampling layer, and the deconvolution layer are the same as those corresponding to the encoder part.

The loss function during training computed with the mean squared error function (MSE) is shown in Figure 10. The loss functions of the training set and validation set gradually decrease with the increase of the number of iterations, and the training effect is good.

To test the reconfiguration effect of the SW1DCAE model on the samples, the vibration signal of the industrial robot in its normal operating state was used as input to observe the reconfiguration effect, as shown in Figure 11.

As can be seen from the figure, inputting the robot vibration signal in the normal state into the SW1DCAE model can reconstruct the input signal and correctly characterize the robot vibration signal in the normal state.

To test the response of the SW1DCAE model to abnormal signals, the experiment simulates vibration anomalies by impacting the robot and then inputs the signal containing the abnormal vibration amplitude into the model. The original and reconstructed signals are shown in Figure 12.

As can be seen from the diagram, the reconstructed signal produced a large error from the input original signal under the impact, which means that an anomaly may have occurred in the robot.

A comparison of the sample reconstruction error calculated by the model with the threshold is shown in Figure 13. The red dotted line represents the threshold, and the reconstruction error exceeding the threshold is judged as abnormal.

The model calculates the reconstruction error of the test sample and compares it with the abnormality judgment threshold. The sample whose reconstruction error exceeds the threshold can be judged as abnormal.

Since the model calculation takes a certain amount of time, the signal reconstruction and detection are delayed, and the delay time is about 9 ms. The model evaluation uses precision, recall, and F1 score as evaluation metrics:

To further demonstrate the superiority of the SW1DCAE model, it was compared experimentally with other detectors, as shown in Table 2.(1)SW1DCAE : model proposed in the text(2)AE : unsupervised learning of robotic vibration signals to detect anomalies using a classically structured autoencoder(3)KNN : a method for anomaly detection using the distance between different eigenvalues(4)SW1DCAE (binary cross-entropy loss function): binary cross-entropy is used as the loss function of this model for abnormal detection of vibration signals

As can be seen from the table, the model proposed in this study outperforms other models in terms of accuracy, recall, and F1 score. The model can extract features directly from the original signal and reconstruct the signal, combining convolutional neural networks, autoencoder, and sliding windows effectively, thus showing good performance in robot vibration anomaly detection.

5. Parameter Sensitivity Analysis

In the model of this study, there are some important parameters, such as the length of the sliding window, the width of the convolution kernel, and the dropout ratio. In this section, we analyze these parameters.

The length of the sliding window plays an important role in whether the model can acquire data features effectively; on the contrary, the length of the sliding window is related to the computational cost and directly affects the performance of the model. The figure shows the performance of the model on the data when the sliding window length is between 30 and 150 and the first convolution kernel width is 32 by comparing the F1 score and detection time. The graph shows that the F1 score of the model is approximately 0.97 when the sliding window length is 50. As the sliding window length increases, the detection time also increases, which means that the training cost is increasing and the F1 score tends to decrease, i.e., the performance of the model is decreasing. Therefore, we should not choose a sliding window that is too long, and the sliding window length of this model is chosen to be 50, as shown in Figure 14.

The input to the model is a one-dimensional vibration signal, and given that the convolution kernel of SW1DCAE has a filtering effect on it, it is necessary to discuss the effect of the width of the convolution kernel on the performance of the model. The figure shows the model training loss for different convolutional kernel widths for a sliding window length of 50. Wider convolution kernels can be effective in extracting data features from one-dimensional signals, but too wide a size can lead to computational complexity and redundant parameters, which in turn affect the features extracted. The fastest and lowest drop in loss occurs when the convolution kernel width is 32. This is shown in Figure 15.

Different convolution kernel widths have different extraction effects on the vibration signal. As can be seen from the figure, the F1 score of the model tends to increase as the width of the convolution kernel increases, but too wide a convolution kernel does not continue to improve the performance of the model but instead takes up a lot of resources and the detection time becomes longer. Therefore, in this study, 32 is chosen as the width of the convolution kernel for the first layer of convolution.

Dropout can suppress the occurrence of network overfitting, improve the generalization ability of the network, and make the neuron representation more independent. The dropout ratio affects the performance of the model, generally taking a value between 0.2 and 0.5 [15], and different ratios were chosen for comparison experiments, and the results are shown in Figure 16.

The curves in the figure represent the F1 scores at different ratios, and the bars represent the loss of the model at different ratios. The experimental results show that when dropout reaches 0.2, the F1 score of the model is the highest and the loss is the smallest, so the dropout ratio is chosen to be 0.2.

Batch size is the number of samples selected for training, and its size affects the degree of optimization and speed of the model [16]. Using F1 score and detection time as evaluation metrics for different batch sizes, it can be seen from the figure that the F1_score of the model is highest when the batch size is chosen as 128, considering the computational cost. This is shown in Figure 17.

6. Conclusion

This study proposes an unsupervised detection method for vibration anomalies in industrial robots based on SW1DCAE. The method uses a sliding window to directly shape the raw time-series signal to better identify normal and abnormal patterns in the data. In addition, the convolutional structure is introduced into the autoencoder model, which improves the learning ability of the model. Finally, experiments verify the superiority and feasibility of the method. The main conclusions are as follows:(1)A one-dimensional convolution kernel is introduced as a computing unit in the autoencoder to effectively extract the spatiotemporal features of the data and improve the performance of identifying normal and abnormal patterns of the data. A sliding window algorithm is introduced to shape the original data into samples of suitable length to better capture the sequence correlation of the data.(2)The method does not require much a priori knowledge of industrial robots and does not require the construction of mathematical or dynamical models, and the required vibration signals are common and easy to measure.(3)Compared with other algorithms, SW1DCAE does not require data annotation and can be applied online without data accumulation. It has more practical value and can be widely used in other fields such as vibration anomaly detection of industrial robots.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was supported by the National Natural Science Foundation of China (No.: 52105182) and the Major Science and Technology Project of Henan Province (Project Name: research and industrialization of key technologies of high-end bearings for major equipment).